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16.  Abstract 

^The  areas  of  automatic  speech  recognition  and  speech  synthesis  are  examined 
to  ascertain  what  possibilities  may  exist  for  implementing  them  in  Coast  Guard 
Communication  Stations.  A  discussion  of  the  state  of  the  art  in  both  speech 
recognition  and  speech  synthesis  is  presented.  Concepts  from  the  disciplines 
are  given,  as  well  as  descriptions  of  many  commercially  available  devices. 

We  do  not  recommend  that  the  Coast  Guard  pursue  the  development  of  a  speech 
recognition  system  at  this  time.  Several  manufacturers  are  attempting  to  develop 
machines  that  will  meet  the  Coast  Guard's  minimum  requirements.  Quite  rapid 
progress  is  characteristic  of  the  speech  recognition  field  and  suitable  systems 
should  be  available  in  a  few  years. 

We  do  recommend  that  the  Coast  Guard  research  the  implementation  of  speech 
synthesis  technology  at  this  time  for  the  following!  specific  tasks,  particu¬ 

larly  weather  reports,  and  general  purpose  use  in  Communication  Stations. 
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EXECUTIVE  SOHRART* 


1.1  GENERAL  EVAj. PATIOS 

Commercially  available  recognizers  nay  not  meet  Coast 
Guard  minimum  requirements,  but  today*s  synthesizers  most 
likely  will  be  very  useful  for  certain  tasks,  such  as 
weather  reports. 


1. 2  RECOMMENDATION:  SPEECH  RECOGNITION 

He  do  not  recommend  that  the  Coast  Guard  pursue  the 
development  of  a  speech  recognition  system  at  this  time. 

RATIONALE: 

Current  technology  does  not  provide  equipment  capable  of 
recognizing  key  words,  such  as  "mayday",  as  found  in 
ordinary  and  expected  transmissions.  Several  manufacturers 
are  attempting  to  develop  machines  that  will  meet  the  Coast 
Guard *s  minimum  requirements.  Quite  rapid  progress  is 
characteristic  of  the  voice  recognition  field  and  suitable 
systems  should  be  available  in  at  most  a  few  years.  It  is 
highly  doubtful  that  a  duplicate  research  effort  funded  by 
the  Coast  Guard  could  provide  suitable  voice  recognition 
units  more  guickly. 


*  A  glossary  of  terns  used  in  this  report  is  given  in 
Appendix  A. 
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1.3  RECOMMENDATION:  SPEECH  SYNTHESIS 


We  recommend  that  the  Coast  Guard  research 
implementation  of  speech  synthesis  technology  at  this  time 
for:  1)specific  tasks,  particularly  weather  reports,  and 

2)  general  purpose  use  in  communication  stations. 

RATIONALE: 

A  speech  synthesis  system  could  be  assembled  from  available 
products  to  automate  weather  reports  specifically  and  to  be 
used  generally  in  other  routine  transmissions  within 
communication  stations.  The  speech  quality  will  be 

consistent,  without  regional  accent,  and  the  synthesis  could 
provide  for  an  extensive  vocabulary.  Since  there  is  a 
tradeoff  between  quality  and  vocabulary  size,  it  is 
understood  that  the  pronunciation  will  be  of  less  than 
broadcast  standard  and  have  some  "machine  quality",  but  will 
be  most  adequate  for  Coast  Guard  needs.  The  system  can  be 
made  both  "user  proof"  and  "user  friendly",  allowing 
operation  by  personnel  with  little  technical  training. 
Synthesized  speech  could  be  generated  instantly  from  reports 
coming  over  the  teletype  with  virtually  no  manpower 
requirements.  Such  a  system  could  be  integrated  into  future 
general  automation. 

1.4  TECHNICAL  BACKGROUND:  SPEECH  RECOGNITION 

Hachines  differ  widely  in  the  sophistication  with  which 


they  can  "recognize" 

or  "understand" 

human  speech. 

The 

simplest  systems  are 

capable  of 

responding  only  to 

an 

exceptionally  limited 

vocabulary. 

Tor 

example,  some 

can 
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understand  only  the  digits  zero  through  nine,  plus  a  very 
fev  selected  vocabulary  itens.  These  words  nust  be  spoken 
one  at  a  tine,  by  one  person  only,  and  the  machine  will 
respond  correctly  to  that  person's  voice  only  after 
"training",  where  the  person  says  each  word  over  and  over  to 
give  the  nachine  sone  idea  of  what  to  expect.  These  systems 
are  said  to  be  "isolated  word  recognizers"  because  each  word 
nust  be  spoken  with  pauses  of  silence  as  boundaries.  They 
are  not  able  to  handle  "connected  or  continuous  speech". 
These  systens  are  also  said  to  be  "speaker  dependent" 
because  they  must  be  trained  by  each  speaker  before  they  are 
able  to  recognize  the  words  correctly. 

Improvements  over  these  simplest  systems  are  of  two 
separate  types:  first,  some  machines  can  identify  the 
vocabulary  items  in  their  list  without  having  to  be  trained 
and  sone  machines  can  receive  connected  or  continuous 
speech.  The  advantage  of  the  systems  which  have  the  first 
improvement  is  that  anyone  can  communicate  with  the  machine 
immediately  without  having  to  train  it.  Such  systems  are 
called  "speaker  independent",  because  the  machine  does  not 
need  information  about  which  individual  is  addressing  it, 
and,  therefore,  can  respond  independently  of  such 
information.  Host  all  of  these  systems  still  have  to  have 
the  words  presented  one  at  a  time. 

The  second  kind  of  improvement  has  resulted  in  machines 
which  have  fairly  extensive  vocabularies  that  can  be 
presented  to  them  in  a  relatively  normal,  continuous  manner. 
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These  machines  can  receive  spoken  input-.  as  ordinary 
sentences,  rather  than  one  word  at  a  time.  They  are  referred 
to  as  machines  capable  of  handling  "connected  or  continuous 
speech".  They  still  have  to  be  "trained"  to  be  able  to 
understand  any  individual  who  is  going  tc  communicate  with 
them. 

From  the  point  of  view  of  virtually  all  applications, 
it  is  unfortunate  that  no  one  system  yet  incorporates  both 
improvements,  thereby  becoming  both  speaker  independent  and 
capable  cf  handling  connected  speech.  Private  industry, 
which  sees  a  major  market  for  improved  speech  recognition 
systems,  is  attempting  to  solve  the  problems  involved  in 
producing  a  widely  useful  recognition  unit. 

A  large  part  of  the  difficulty  with  producing  a  tetter 
system  is  purely  technical.  That  is,  the  basic  approaches 
used  so  far  seem  likely  to  continue  to  be  fruitful,  but  they 
need  refinement.  The  heart  of  the  problem  lies  in  the  fact 
that  no  machine  actually  "understands"  anything  in  the 
intuitive  and  rational  way  a  human  being  does.  The  machine 
does  not  "recognize"  anything,  either,  in  the  way  humans  do. 
Shat  machines  actually  do  is  follow  a  set  of  instructions 
which  are  individually  very  simple,  yet  very  numerous  and 
assembled  in  complex  ways.  Se  give  an  oral  command  and  say 
that  the  machine  "understands"  it.  But  what  we  mean  is  only 
that  the  action  produced  by  the  last  instruction  is  the 
action  a  human  would  take  upon  hearing  the  same  original 
oral  command.  To  get  the  machine  to  behave  correctly,  all 
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the  interaediate  instructions  between  the  command  and  the 
action  must  be  correct.  It  is  already  a  complex  task  to 
assemble  instructions  that  are  known  to  be  useful.  Tn 
addition,  to  produce  improved  speech  recognizers,  new  types 
of  instruction  aust  be  developed  and  integrated  into  the 
systems. 

One  of  the  central  issues  in  speech  recognition  lies  in 
the  area  of  "template  matching".  Cne  way  for  a  machine  to 
decide  whether  a  word  that  it  "hears"  is  the  same  as  a  word 
that  it  "knows"  is  to  make  a  comparison  between  a  stored 
"template"  and  the  incoming  word  that  has  just  been  spoken. 
Humans  do  this  intuitively  without  being  able  to  explain  how 
they  do  it.  We  know  exactly  how  a  machine  does  it,  because 
we  aust  give  it  explicit  instructions  on  what  features  to 
consider,  when  it  identifies  an  acoustic  feature  in  both  the 
template  and  the  input  signal,  it  must  be  told  how  much  of 
that  feature  must  be  similar,  if  not  identical,  to  the 
template  to  count  as  "the  same".  When  it  has  identified  all 
the  features  as  "sane"  or  "different",  it  aust  be  told  what 
proportion  of  all  features  aust  score  as  "same"  to  result  in 
the  whole  word  being  considered  to  be  an  example  of  the 
"same  word". 

No  two  humans  speak  exactly  like  one  another,  and  no 
one  person  always  says  the  same  word  in  the  sue  way  every 
time.  It  is  difficult  to  give  a  machine  instructions  that 
allow  for  sufficient  flexibility  and  at  the  same  time 
preserve  the  essential  patterns.  Any  number  of  small 
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variations  that  huaans  handle  with  ease  can  generate  wrong 
decisions  in  a  nachine.  A  word  spoken  more  rapidly  or  more 
slowly  than  the  teaplate  word  may  Batch  at  the  beginning, 
for  the  first  phoneme.  By  the  tine  the  input  word  arrives  at 
the  fifth  phoneae,  a  simple  machine  nay  be  trying  to  compare 
that  fifth  phoneae  of  the  input  with  the  fourth  phoneae  of 
the  template.  Such  a  comparison  would  result  in  an  erroneous 
judgment  of  "different".  Techniques  such  as  time  warping 
attempt  to  ensure  that  the  machine  will  stret  or  shrink 
the  input  to  fit  the  teaplate,  but  they  have  n«  yet  reached 
the  level  of  sophistication  needed  to  assure  c  -ct  matches 
in  every  case. 

Analogous  problems  exist  for  input  words  produced  at 
levels  of  loudness  or  stress  different  from  the  template,  or 
words  spoken  with  regional  accents  different  from  that  of 
the  person  who  produced  the  words  on  which  the  template  is 
based.  For  these  problems  partial  solutions  exist,  each 
roughly  as  successful  as  time  warping,  but  none  guaranteeing 
totally  correct  recognition. 

A  different  kind  of  problem  is  presented  by  noise. 
Humans  have  a  capacity  for  "selective  attention"  by  which 
they  automatically  pay  attention  to  the  speech  sounds  and 
ignore  any  random  hiss,  crackle,  bang,  or  other  non-speech 
sound.  As  far  as  a  machine  is  concerned,  any  sound  that 
enters  the  system  is  as  important  as  any  other.  In  order  to 
prevent  the  machine  from  trying  tc  match  irrelevant  noises 
with  features  in  the  template,  a  way  must  be  found  to 


separate  the  noise  from  the  desired  signal.  Methods 
developed  to  date  are  successful  only  with  low  levels  of 
noise,  although  i ■ pr o ve me nt  is  being  made. 

In  short,  machines  are  extremely  different  from  people, 
and  in  performing  tasks  of  speech  recognition,  far  less 
competent. 

Of  the  approximately  19  speech  recognition  units 
reviewed,  not  one  is  of  a  level  of  sophistication  to  meet 
minimum  Ccast  Guard  requirements.  The  rapid  progress  oF 
tasic  research  in  speech  recognition,  however,  makes  it 
appear  likely  that  suitable  units  will  be  available  for 
purchase  in  a  few  years,  but  it  is  not  recommended  that  the 
Toast  "uard  implement,  the  technology  at  this  time. 

1.c.  TECHNICAL  BACKGROUND:  SPEECH  S  T  NT  H  EG  IS 

Speech  synthesis  is  the  science  of  producing  human 
speech  by  artificial  means,  usually  by  performing  various 
operations  on  stored  material.  The  stored  data  base  may  or 
may  not  ultimately  derive  from  recorded  human  voices. 
Systems  that  derive  from  recordings  have  the  advantage  of 
sounding  fairly  natural,  but  they  have  the  disadvantage  of  a 
limited  vocabulary  for  a  given  set  of  messages  and  of  high 
cost  due  to  large  storage  requirements.  One  of  the  most 
successful  of  such  systems  is  generally  known  as  LPC 
synthesis.  (LPC  stands  for  "linear  prediction  coding",  and 
refers  to  the  method  by  which  the  computer  selects  the 
material  it  extracts  from  the  original  speech  for  storage, 
from  which  it  will  produce  an  imitation  of  the  human  voice. 
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I  PC  synthesis  uses  a  digital  filter  to  model  the  human  vocal 
tract,  it  is  based  upon  the  statistical  assumption  that 
human  speech  changes  relatively  slowly,  and  that  it  is 
possible  to  predict  the  next  set  of  acoustic  measures  based 
on  a  knowledge  of  previous  ones.) 

The  other  major  class  of  synthesizers  falls  into  the 
category  of  rule  synthesizers  (also  called  text- to-speech 
synthesizers  or  phoneme  synthesizers)  which  are  not  based  on 
recorded  human  speech.  These  synthesizers  store  a  formula 
for  the  components  that  represent  the  sounds  of  the  letters 
in  ordinary  spelling.  Such  sound  components  are  called 
phonemes.  Much  as  a  group  of  letters  are  assembled  to  form  a 
written  word,  so  a  related  set  of  phonemes  are  assembled  to 
form  a  spoken  word.  The  name  "rule  synthesizer"  emphasizes 
one  aspect  of  such  word-building,  the  fact  that  there  are 
general  patterns  in  the  English  language  which  can  te 
described  in  the  form  of  a  set  of  rules  for  the  computer  to 
apply. 

One  set  of  rules  involves  English  spelling,  which  is 
often  notoriously  non- phone  tic.  for  example,  "so"  and  "do" 
have  the  same  letter  at  the  end,  but  they  are  net 
phonetically  pronounced  the  same.  The  rules  help  the 
computer  obtain  basic  pronunciations  for  most  words  which 
follow  general  spelling  rules.  Words  such  as  "knowledge"  or 
"freight",  where  the  pronunciation  would  be  absurd  if  all 
the  letters  were  pronounced,  are  treated  separately.  The 
other  sets  of  rules,  and  much  the  more  technically  demanding 
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to  develop,  are  the  ones  that  refine  the  pronunciation  from 
the  first  attempt  to  something  more  acceptable  to  the  human 
listener. 

There  are  two  general  areas  in  which  refinement  is 
needed.  One  is  that  when  the  units  of  sound  (individual 
phonemes)  are  first  assembled/  they  may  not  blend  smooth  1; 
together  and  form  a  recognizable  word,  even  though  the 
correct  sounds  are  present  in  the  right  order.  What  is 
needed  is  a  set  of  rules  linking  each  phoneme  to  the  next 
with  suitable  transitions  to  smooth  the  pronunciation  and 
unify  the  sound  of  the  word.  The  other  area  where  rules  are 
needed  is  to  assemble  the  words,  which  may  individually  be 
acceptable,  into  a  sentence  which  flows  smoothly  in  patterns 
of  rhythm  and  emphasis  expected  by  a  normal  human  listener. 

Any  sentence,  however  reasonable,  becomes  suddenly 
difficult  to  understand  if  the  individual  words  are 
separated  and  said  one  at  a  time,  as  if  in  a  list.  What  a 
list  lacks  are  the  subtleties  of  emphasis  that  let  the 
listener  know  which  words  are  minor  parts  of  the  utterance. 
This  general  pattern  or  "melody”  of  the  sentence  is  usually 
referred  to  as  the  intonation.  Developing  rules  for  natural¬ 
sounding  intonation  is  probably  the  most  difficult  part  of 
text-to-speech  or  rule  synthesis  techniques,  and  the  area  in 
which  a  listener  is  most  likely  to  find  fault  with  the 
"machine  guality"  of  the  speech. 

The  advantages  of  rule  synthesis  are  numerous.  The 
storage  requirements  are  small,  making  that  aspect 


inexpensive.  The  vocabulary  can  be  unlimited.  Any  text  can 


be  converted  to  speech  automatically,  e.g.,  urgent  messages 
from  the  Coast  Guard  Communicat  icns  Stations  to  the  marine 
community  can  become  speech  instantly.  Some  pronunciations 
mill  probably  be  incorrect,  but  can  be  improved  with 
respelling  of  the  text. 

Hybrid  systems  also  exist,  as  do  linguistically 
sophisticated  systems  requiring  extensive  specialized 
knowledge  for  their  operation.  These  systems  will  not  be 
discussed  in  detail  in  this  report  for  they  do  not  appear  to 
meet  Coast  Guard  Fequ  irements. 

In  summary,  state-of-the-art  speech  synthesis  best 
lends  itself  to  adaptation  to  specific  Coast  Guard  needs, 
such  as  broadcasting  weather  reports,  and  to  general  use  in 
certain  applications  within  communication  stations. 

Approximately  32  synthesizers  of  different  types 
available  for  purchase  are  reviewed.  Of  these,  perhaps  two 
or  three  are  potentially  adaptable  to  Coast  Guard  needs, 
although  none  is  an  exact  match  to  Coast  Guard 
specifications.  Considerable  work  would  be  required  to 
achieve  the  best  possible  mix  of  such  factors  as  naturalness 
and  intelligibility  of  speech,  ease  of  operation,  limitation 
of  cost,  and  potential  for  integration  into  future  large- 
scale  ccmmunica tions  automation.  Ultimately,  however,  the 
use  of  synthesis  for  certain  Coast  Guard  tasks,  such  as 
weather  reports,  would  be  advantageous  and  is  recommended 


for  consideration. 


CHAPTER  2 


REVIEW  OP  COAST  GUARD'S  STATEMENT  OF  WOP  K 

This  report  discusses  two  advanced  technologies,  speech 
recognition  and  synthesis,  for  possible  use  at  Coast  Guard 
Communications  Stations  and  Radio  Stations.  SCRL  was  asked 
to  consider  that  speech  recognition  technology  be  used  as  an 
aid  to  watch  standers  for  spotting  distress  calls  over  the 
marine  VHF-FM,  RP(voice),  and  MF  (voice)  frequencies.  In 
particular,  the  Coast  Guard  has  expressed  an  interest  in 
keyword  spotting  for  incoming  broadcast  messages  (for 
example,  automatically  recognizing  "mayday",  "fire", 
"sinking",  etc.).  Such  keyword  spotting  would  be  a  means  of 
reducing  the  error  rate  in  monitoring  distress  frequencies. 


SCRL 

also 

considered 

that  speech 

synthesis 

techniques  be 

used 

for 

automatic 

broadcasting  of 

stored 

text 

messages. 

such 

as 

weather 

information. 

not  ices 

to 

mariners. 

hydrographic  information,  storm  warnings,  advisories,  safety 
messages,  and  urgent  messages. 

2.1  SPEECH  RECOGNITION  TECHNOLOGY  AND  COAST  GUARD  PLANNING 

Two  Coast  Guard  publications  were  furnished  to  SCRL  for 
evaluation  in  this  area:  1)  Telecommunications  Manual 

(COMDTINST  H2000.  3A) ,  and  2)  Coast  Guard  Radio  Frequency 
Plan  (COMDTINST  M2400.  1A). 
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The  Coast  Guard's  Statement  of  Work  covering  this 
project  further  noted  that  its  aain  area  of  interest  in 
speech  recognition  technology  involved  keyword  spotting  in 
connected  or  continuous  speech.  It  also  stated  that  the 
Coast  Guard  was  aware  of  problems  involved  with  applying 
speech  recognition  technology  to  Coast  Guard  needs  in  this 
area.  These  included  distortion  and  noise  in  incoming 
signals  due  to  radio  transnission  and  problems  in 
recognition  due  to  coarticulations  in  different  phonetic 
conte  xts. 

Perhaps  the  nost  serious  problen  is  that  the  Coast 
Guard  obviously  requires  a  completely  speaker  independent 
recognizer.  Currently  available  speaker  independent 

recognizers  only  operate  reliably  with  digits,  that  is, 
words  such  as  one,  two,  three,  four,  etc.  Recognition  of 
words  other  than  digits  requires  training  the  recognizer  for 
individual  speakers'  vocabularies.  Usually  this  is 

accomplished  by  the  recognizer  system  prompting  the  user, 
who  speaks  the  desired  vocabulary  items  so  they  can  be  used 
as  templates  for  matching  with  incoming  words.  Several 
training  passes  {repetitions)  are  usually  required  in  order 
to  "train”  the  speaker  dependent  recognizer.  It  is  not 
feasible  to  obtain  training  material  from  all  speakers  of 
messages  which  are  received  by  the  Coast  Guard.  Thus,  the 
speech  recognition  technology  must  be  speaker  independent  to 
be  useful  for  Coast  Guard  application. 
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There  is  no  currently  available  speech  recognizer  that 
would  aeet  the  Coast  Guard’s  requirements  for  a  speaker 
independent  recognizer  capable  of  keyword  spotting  for 
incoming  distress  signals.  Several  companies  are  working 
diligently  in  this  area,  but  it  should  be  at  least  several 
more  years  before  any  manufacturer  is  able  to  market  such  a 
recognition  system. 

A  second  requirement  for  the  spotting  of  keywords  in 
distress  signals  concerns  the  need  for  a  recognizer  that  can 
handle  connected  or  continous  speech.  Host  all  voice 
recognizers  are  isolated  word  recognizers  which  require 
pauses  between  words  (or  simple  phrases  which  are  treated  as 
words)  so  that  problems  of  coarticulations  in  different 
phonetic  contexts  will  be  avoided.  Recently,  however, 
certain  companies  have  marketed  recognizers  which  are 

» 

capable  of  handing  connected  speech  up  to  1R0  words  per 
minute.  Unfortunately,  these  connected  or  continuous  speech 
recognizers  are  not  speaker  independent.  So,  they  do  not 
meet  both  needs  of  the  Coast  Guard. 

Another  major  technical  problem  with  spotting  keywords 
in  Coast  Guard  distress  signals  would  be  that  such  signals 
have  a  relatively  low  signal-to-noise  ratio.  SCPL's 
analysis  of  Coast  Guard  signals  revealed  a  signal-to-noise 
1  ratio  which  ranged  from  approximately  13dB  up  to  an 

approximate  30dB,  with  an  average  of  23dB.  Such  a  low 
:  signal-to-noiso  ratio  creates  a  real  problem  for  currently 

i 
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available  recognizers.  While  it  is  true  that  recognizers 
will  generally  operate  with  a  relatively  high  degree  oc 
background  noise,  it  should  also  be  noted  that  this 
background  noise  must  be  of  a  periodic  nature  or  it  will 
create  false  recognitions.  Incoming  signals,  with  their 
pops,  clicks,  and  other  nonperiodic  sounds,  would  present 
problems  in  this  area.  Also,  incoming  distress  signals 
received  by  the  Coast  Guard  involve  different  speaker  rates, 
dialects,  and  accents  which  include  phonetic  and  prosodic 
variability.  These  areas  all  involve  problems  for  speech 
recognition  techniques  and  should  be  better  resolved  before 
the  Coast  Guard  select  any  form  cf  speech  recognition 
technology. 

2.2  SPEECH  SYNTHESIS  TECHNOLOGY  ANP  COA  S*  GHATO  PtAVN^NG 

The  Coast  Guard  may  be  interested  in  speech  synthesis 
technology  as  it  relates  to  automatic  broadcasting  of  stored 
text  messages,  such  as  weather  information,  notices  to 
mariners,  hydrographic  information,  stcrm  warnings, 
advisories,  safety  messages,  and  urgent  messages. 

One  important  Coast  Guard  consideration  relating  to  the 
potential  use  of  speech  synthesis  technology  is  that  it 
would  help  to  ease  manpower  requirements  fcr  the  production 
cf  required  Coast  Guard  broadcasts.  Synthesized  utterances 
can  be  readily  obtained  for  transmission.  Synthesized 
utterances  should  approach  those  of  a  trained  broadcaster  in 
overall  quality.  .Synthesized  broadcasts  would  have  no 
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variances  in  voice  characteristics  due  to  changes  in  the 
distance  between  the  microphone  and  the  mouth,  or  due  to 
regional  dialect  differences.  Tt  Bight  be  added,  however, 
that  if  changes  in  phonetic  or  prosodic  codings  are  desired 
to  produce  different  alertness  responses,  these  variations 
can  be  generated  by  voice  synthesis.  Also  any  text  can  be 
converted  to  speech  automat  ical ly ,  e.g.  incoaing  weather 
reports  coaing  over  a  teletype  can  becoae  speech  instantly. 
The  Coast  Guard  also  feels  that  digital  storage  techniques 
are,  of  course,  aore  reliable  and  easier  to  aaintain  and 
edit  than  analog  tape  recordings  or  magnetic  druas. 

Currently  the  Coast  Guard  uses  analog  speech  recordings 
which  have  good  quality,  but  they  present  the  following 
problems:  a)  vocabularies  are  limited  in  size,  b)  the 
recordings  are  difficult  to  aodify,  c)  the  recordings  are 
difficult  to  operate  autoaaticall y,  and  d)  the  recordings 
produce  a  discontinuous  dialog  when  spliced  together. 

Soae  of  the  above  problems  aight  be  eliainated  by  the 
use  of  speech  synthesis.  This  technology  produces  sounds 
associated  with  basic  units  of  speech  (phoneaes)  which  are 
coabined  to  make  words.  Electronic  logic  reads  stored  text, 
assembles  phoneaes  into  words  or  sentences  in  the  proper 
sequence  and  outputs  the  desired  synthesized  utterances. 
Frosodic  characteristics  (such  as  stress,  pitch,  and 
duration)  can  be  modified  as  required  to  produce  the  desired 
pronunciation  characteristics  of  synthesized  utterances. 


The  desire  to  have  utterances  of  an  unlimited  vocabulary  and 
simultaneously  of  good  quality  presents  a  challenge  in  that 
a  tradeoff  exists  between  vocabulary  size  and  quality. 
Quite  obviously,  a  very  limited  number  of  vocabulary  items 
(such  as  the  digits)  can  be  carefully  synthesized  to  obtain 
very  good  quality.  On  the  other  hand,  it  is  difficult  to 
synthesize  an  exceedingly  large  vocabulary  (such  as  10,000 
words)  with  an  equivalently  good  quality.  The  type  of 
synthesis  used  in  each  case  would  be  different,  as  discussed 
below. 

2.2.1  TYPES  OF  SPEECH  SYNTHESIZERS.  It  is  important  that 
information  be  provided  to  the  Coast  Guard  regarding  the 
possible  implementation  of  speech  synthesis  to  meet  its 
broadcast  requirements.  The  various  types  of  synthesizers 
available  will  be  briefly  detailed  here,  since  different 
types  of  speech  synthesizers  have  specific  advantages  and 
disadvantages  as  they  relate  to  Coast  Guard  needs. 
Basically,  there  are  three  main  types  of  speech  synthesis: 

1)  Analysis  synthesis  or  LPC  synthesis  -  such 
synthesizers  typically  rely  upon  stored  linear 
prediction  coefficients  which  are  used  to  define  a 
digital  filter  which  simulates  the  human  vocal  tract. 
Such  synthesizers  typically  exhibit  high-quality  speech 
output,  with  realistic  prosodic  features,  such  as 
stress,  intonation,  etc.  LPC  synthesizers  are  not 
generally  geared  to  the  production  of  specific 
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phonemes,  but.  are  set  to  output  whole  words  based  upon 
real  human  speech  which  has  been  analyzed.  A 
limitation  of  analysis  synthesis  synthesizers  is  that 
they  typically  require  large  amounts  of  computer 
storage  for  individual  words,  since  words  are  stored  as 
complete  units. 

2)  Rule  synthesis  -  such  synthesizers  do  not  rely  upon 

an  actual  analysis  of  speech  as  a  basis  for  output  of 
synthesized  utterances.  Instead,  rule  synthesizers  use 
combinations  of  different  parameters  which  are  designed 
to  simulate  actual  speech.  Rule  synthesizers  generate 
combinations  of  basic  phonemes,  so  they  typically 
exhibit  large  vocabularies.  There  are  certain 

limitations  to  rule  synthesizers.  Their  output  speech 
is  typically  not  of  the  same  quality  as  IPC  synthesis. 
Rule  synthesizers  have  difficulty  with  prosodies  such 
as  stress,  intonation,  etc.  They  also  encounter 
problems  with  coarticulations,  since  different  phoneme 
combinations  have  different  coarticulations. 

3)  Digital  recordings  for  synthesis  -  devices  of  this 
type  are  not  speech  synthesizers  in  the  literal  sense 
of  the  term.  The  approach  is  to  digitally  record  human 
speech  which  is  typically  stored  on  ISI  chips  for 
subsequent  playback.  One  advantage  to  digital 
recordings  is  that  they  exhibit  high  quality  audio 
output,  since  they  consist  of  actual  M recordings"  of 
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human  speech.  On  the  other  hand,  such  an  approach  does 
not  allow  one  to  synthesize  novel  utterances,  or  to 
combine  phonemes  to  achieve  a  very  large  output 
vocabulary. 

SCR L  notes  that  several  manufacturers  of  synthesizers 
now  market,  or  are  planning  to  market,  text-to-speech 
systems  which  are  advanced  rule  synthesizers.  These  are 
designed  to  allow  the  user  to  type  in  a  seguence  of  words  at 
a  computer  terminal,  which  are  subsequently  output  as  whole 
spoken  sentences.  Such  text-to-speech  synthesizers 

generally  will  include  not  only  segmental  (phoneme  or 


letter)  encoding. 

but 

also  suprasegmen tal 

(prosodic) 

information,  such 

as 

appropriate  stress 

levels 

a  nd 

intonation  patterns. 

to 

improve  their  output. 

This 

is  a 

definite  advantage  where  the  user  wants  to  output  whole 
sentences  or  phrases  with  natural -sounding  prosodic 
patterns. 

2.2.2  COAST  GOAR  0  CONSIDERATIONS.  For  all  types  of 
synthesizers,  the  size  of  the  broadcast  vocabulary  is  a 
primary  consideration.  The  Coast  Guard  appears  to  reguire  a 
synthesizer  with  a  very  large,  if  not  unlimited,  vocabulary. 
Rule  synthesizers  have  a  definite  advantage  in  this  area.  A 
major  consideration  regarding  the  use  of  voice  synthesis  by 
the  Coast  Guard  involves  the  degree  to  which  vocabulary 
changes  would  have  to  be  made.  It  appears  that  the  Coast 
Guard  would  require  very  frequent  changes  in  their  broadcast 


vocabulary.  Distress,  safety,  and  urgent  messages  might 
require  such  changes.  As  noted,  the  rule  synthesizers  do 
have  a  very  definite  advantage  in  this  area,  as  they  can 
string  phonemes  together  to  create  new  words  without, 
difficulty.  On  the  other  hand,  analysis  synthesis  typically 
does  not  include  this  capability,  but  does  allow  the  user  to 
string  different  combinations  of  words  together,  often 
within  the  context  of  some  basic  sentence  to  preserve 
natural  sentence  intonation.  Weather  broadcasts  might  be 
synthesized  using  this  technique  until  more  natural  sounding 
speech  is  generated  by  rule  synthesis,  assuming  the  main 
vocabulary  is  relatively  fixed.  New  items,  such  as  names  of 
storms,  could  be  added  to  the  fixed  vocabulary  as  needed. 

Another  very  important  consideration  involves  the 
quality  of  broadcast  messages.  It  can  be  assumed  that  the 
Coast  Guard  requires  high-quality  speech  output,  with  good 
prosodic  characteristics  and  no  serious  audio  degradation  of 
broadcast  messages  dae  to  difficulties  with  coarticulations 
between  phonemes.  Note  that  analysis  synthesis,  such  as 
provided  in  LPC  synthesizers,  does  exhibit  high-quality 
phonetic  and  prosodic  characteristics,  since  it  simulates 
actual  human  performance  in  this  area.  Rule  synthesizers 
typically  include  at  least  several  levels  of  stress,  which 
must  be  manipulated  by  the  user  to  ensure  realistic-sounding 
output.  It  is  important  that  the  Coast  Guard  have  the 
opportunity  to  evaluate  the  acceptability  of  output  from 
different  types  of  speech  synthesizers  for  its  use. 
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2.2.3  LEVELS  OE  SYNTHESIS  PPODtJCTS.  It.  should  be  noted  that 
there  are  several  levels  of  synthesis  products  which  are 
coonerciall y  available,  just  as  there  are  several  types  of 
commercial  synthesizers.  These  products  are  detailed  to 
assist  the  Coast  Guard  to  become  more  familiar  with  various 
options  regarding  implementation  of  voice  synthesis 
techniques.  There  are  basically  three  levels  of  synthesis 
products  available.  First,  there  are  the  complete  systems, 
which  cone  with  a  host  computer  and  which  require  minimal 
software.  Second,  there  are  the  board-level  products,  which 
are  designed  to  plug  into  a  host  computer.  Finally,  there 
are  the  LSI  chip  level  products  which  must  be  integrated 
into  circuit  boards  before  they  can  be  used. 

There  are  several  synthesizers  which  come  with  a  host 
computer  and  all  relevant  software.  The  main  advantage  to 

such  synthesizer  systems  is  that  they  require  virtually  no 
installation  or  host  software.  For  example.  Centigram 
markets  a  speech  development  system,  complete  with  a 
digitizer,  host  computer,  disk,  and  a  parametric  waveform 
synthesizer.  The  main  advantage  to  such  a  system  is  that  it 
requires  no  software  for  integration  with  a  host  computer. 
Such  a  system  is  particularly  applicable  where  users  might 
have  a  need  for  an  additional  host  computer  or  do  not  yet 
have  a  computer  system. 

Board-level  speech  synthesizers  are  generally  designed 
to  plug  into  FS  232-C  interfaces,  the  most  common  type  of 


computer  interface.  Such  synthesizers  are  available  in  a 
wide  variety  of  configurations:  analysis  synthesis  types, 

rule  synthesis  types,  and  digital  recording  types.  These  do 
require  that  the  user  have  a  host  computer  to  control  the 
synthesizer,  as  well  as  in  many  cases  to  store  additional 
vocabulary  which  is  to  be  synthesized.  In  most  cases,  the 
necessary  software  is  supplied  by  the  manufacturer.  Note 
that  board-level  synthesizers  are  generally  quite  reasonable 
in  price  (from  approximately  $500  to  $3,  000),  depending  upon 
the  synthesizer  configuration  desired. 

Finally,  speech  synthesizers  are  available  as  LSI  chip 
level  products,  which  have  to  be  integrated  into  circuit 
boards  before  they  can  be  used.  Actually,  such  chip  level 
synthesizers  are  generally  sold  to  original  equipment 
manufacturers  (0.  E.  H.  )  for  use  in  consumer  products,  home 
computers,  etc.  Chip  level  synthesizers  are  also  available 
in  a  wide  variety  of  configurations.  Integrating  chip  level 
synthesizers  into  a  large-scale  speech  synthesis  strategy 
requires  a  relatively  high  degree  of  engineering  and 
electronics  sophistication  on  the  part  of  the  user. 
Naturally,  along  with  actual  LSI  speech  synthesis  chips,  the 
user  must  include  additional  chips  for  storage  of 
vocabulary,  clock  timer  circuitry,  etc.  All  in  all,  this 
would  have  to  be  considered  the  most  complex  approach  the 
Coast  guard  could  take  with  regard  to  speech  synthesis,  and 
one  that  should  be  approached  with  caution.  This  is 
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particular/  true  in  that  manufacturers  typically  do  not 
provide  necessary  controller  software  with  their  chip 
level  products. 

One  point  which  should  be  made  is  that  most 
manufacturers  of  speech  synthesis  products  are  very  willing 
to  work  with  users  in  the  actual  setup  of  their 
synthesizers.  In  nany  cases,  too,  manufacturers  offer 
custom  boards  designed  for  quite  specific  purposes.  Should 
the  Coast  Guard  purchase  a  speech  synthesizer  system,  it  can 
he  assumed  that  manufacturers  would  be  willing  to  work  with 
them  closely  to  get  their  system  operational.  Also, 
manufacturers  typically  offer  custom  vocabularies  for  their 
board  level  products,  to  suit  quite  specific  needs.  This  is 
particularly  true  for  analysis  synthesis  (LPC )  synthesizers 
which  generally  are  not  designed  to  string  phonemes  together 
to  create  new  vocabulary  items.  The  Coast  Guard  should  be 
aware  that  custom  synthesizers  do  require  considerable  lag 
time  for  the  manufacturer  to  prepare  LSI  chips  with  desired 
vocabulary  items  and  to  integrate  these  into  actual  circuit 
boards.  Changes  in  the  vocabulary  of  analysis  synthesis 
type  synthesizers  typically  require  installation  of 
additional  LSI  chips  containing  the  proper  vocabulary. 

2.2.4  ASSUMPTIONS  REGARDING  COAST  GUARD  SPEECH  SYNTHESIS 
NEEDS.  Based  upon  general  considerations  regarding  the 
actual  use  of  speech  synthesizers,  SCRL  is  able  to  make 
several  important  assumptions  regarding  the  potential  use  of 
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speech  synthesizers  by  the  Coast  Guard.  These  assumptions 
should  help  clarify  the  type  of  synthesizer  the  Coast  Guard 
might  wish  to  use  for  meeting  its  broadcast  requirements. 

1)  The  use  of  speech  synthesis  strategy  would  avoid 
several  of  the  problems  the  Coast  Guard  now  faces  in 
meeting  its  broadcast  requirements.  For  example, 
speech  synthesizers  do  not  typically  require  the  use  of 
soundproof  booths  as  do  current  Coast  Guard  broadcasts. 
Synthesizers  also  avoid  the  problem  of  speakers 
enunciating  broadcasts  at  different  repetition  rates, 
or  with  different  dialects.  Speech  synthesizers  also 
avoid  problems  with  varying  distances  between  the  mouth 
and  microphone,  inherent  in  analog-type  recordings  fo. 
broadcast. 

2)  The  Coast  Guard  apparently  requires  an  essentially 
unlimited  vocabulary  for  its  total  broadcast 
requirements.  This  assumption  is  based  upon  the  fact 
that  broadcasts  would  have  to  name  ships,  storms,  etc. 
If  broadcasts  were  based  upon  a  stored  set  of 
vocabulary  items,  it  seems  most  likely  that  this 
vocabulary  would  have  to  undergo  very  freguent  changes. 
This  all  argues  for  a  rule  synthesizer,  which  would 
have  the  capability  to  concatenate  phonemes  to  create 
new  vocabulary  items  as  desired. 

3)  It  can  be  assumed  that  the  Coast  Guard  would  require 
very  hiqh-quality  speech  synthesis,  since  broadcast 
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messages  are  transmitted  over  radio  channels  which 
would  subject  them  to  further  acoustic  degradation. 
Towards  this  end,  the  Coast  Guard  should  have  the 
opportunity  to  evaluate  the  output  from  various  types 
of  synthesizers  to  assure  that  such  output  would  meet 
its  needs.  The  Coast  Guard  should,  if  possible,  check 
the  guality  of  synthesized  broadcasts  to  see  if  they 
are  acoustically  acceptable  after  they  have  been 
broadcast  over  Coast  Guard  radio  channels.  This  would 
help  to  ensure  their  overall  acceptability  for  Coast 
Guard  purposes. 

4)  Whatever  speech  synthesis  strategy  the  Coast  Guard 
adapts,  it  should  not  require  elaborate  operator 
training  so  as  to  save  manpower. 

5)  Any  synthesizer  system  considered  should  be  very 
time-efficient,  having  the  capability  to  output  desired 
broadcasts  instantaneously  from  teletype  messages 
without  long  turnaround  times. 

6)  Any  synthesizer  system  considered  by  the  Coast  Guard 

should  be  highly  reliable,  with  a  backup  system,  if 
possible.  Actually,  since  currently  available 

synthesizers  rely  upon  LSI  circuitry,  they  are 

typically  very  reliable  for  they  contain  no  moving 
parts. 

7)  The  Coast  Guard  should  investigate  fully  just  what 
software  will  be  required  for  any  synthesizer  system  it 


24 


might  wish  to  consider  and  see  that  this  synthesizer 


would  fit  in  with  its  overall  system  requirements, 
including  host  computer  interfacing,  programming 
languages,  etc.  All  types  of  speech  synthesizers  have 
their  advantages  and  disadvantages.  The  Coast  Guard 
should  carefully  weigh  these  before  opting  for  any 
particular  type  of  synthesizer  system. 

In  conclusion,  SCRL  is  optiaistic  about  the  Coast 
Guard's  use  of  speech  synthesis  strategies  in  meeting  its 
broadcast  requirements.  Such  synthesizers  should  have 
several  advantages  over  currently  used  broadcast  techniques; 
not  the  least  of  these  advantages  is  the  relatively  low 
price  of  several  commercial ly  available  speech  synthesizers. 
Finally,  SCFL  stresses  the  point  that  speech  synthesis 
should  present  one  option  by  which  the  Coast  Guard  should  be 
able  to  save  manpower  in  meeting  its  broadcast  requirements, 
and  maintain  or  increase  the  quality  of  Coast  Guard 


broadcasts 


CHAPTER  3 


STGHAl  AHALTSTS  OP  SELECTED  COAST  GtlARD  PROADCAS*S 

This  chapter  contains  an  acoustic  evaluation  of  the 
sample  Coast  Guard  broadcasts  which  were  supplied  to  SCR1 
for  analysis.  There  were  15  sample  broadcasts  contained  on 
the  analog  tape  supplied  to  SCRL  by  the  Coast  Guard.  The 
tape  was  recorded  at  the  0.  S.  Coast  Guard  Communications 
Station,  Honolulu,  in  late  19H0.  The  sample  broadcasts  were 
arbitrarily  selected,  to  be  typical  signals  received  that 
radiomen  felt  represented  the  range  of  poor  to  excellent 
signals.  These  sample  broadcasts  were  input  to  several 
acoustic  analyses.  These  analyses  were  designed  to  help 
establish  how  the  Coast  Guard  broadcasts  might  interact  with 
speech  input/output  technologies.  In  particular,  we  wanted 
to  determine  how  well  Coast  Guard  broadcasts  might  interact 
with  speech  recognition  technologies  which  are  either 
currently  available  or  under  development. 

SCRI  used  its  Interactive  Laboratory  System  (TLS)  fcr 
analysis  of  the  acoustic  characteristics  of  Coast  Guard 
broadcasts.  SCPL's  TLS  system  operates  on  a  DEC  PDP  11-45 
computer,  with  an  RSI-11-D  operating  system.  The  basic 
procedure  for  acoustic  analysis  of  Coast  Guard  signals  was 
to  digitize  them  from  analog  tape  and  perform  acoustic 
analyses  upon  these  digitized  waveforms.  There  were  several 
steps  which  were  carried  out  for  the  acoustic  analysis  cf 
Coast  Guard  signals.  These  included: 
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1)  Analysis  of  pitch  and  RfiS  amplitudes  of  signals 


Such  an  analysis  was  carried  out  with  in-house 
software.  The  program  provided  both  a  display  of  pitch 
contours  and  FMS  amplitudes  of  input  waveforms,  and  a 
hardcopy  which  was  output  by  SCRL's  lineprinter,  as 
shown  in  Figure  1.  The  primary  purpose  of  this 
analysis  was  to  obtain  numbers  which  would  allow  us  to 
compute  the  signal-to-noise  ratio  for  selected  Coast 
Guard  signals.  Such  a  computation  is  an  important 
measure  for  a  preliminary  analysis  of  how  Coast  Guard 
signals  might  interact  with  speech  recognition 
technology,  where  this  is  a  primary  consideration. 

2)  Spectral  analysis  of  Coast  Guard  broadcasts  which 
was  carried  out  using  TLS  software.  Specifically,  ILS 
was  used  to  compute  inverse  filter  coefficients  on 
digitized  signals.  Following  this,  the  spectral 
content  of  specific  frames  of  input  data  were  displayed 
and  copied  on  SCBL’s  hardcopy  unit.  The  purpose  behind 
this  approach  was  to  allow  for  analysis  of  the  spectral 
content  of  Coast  Guard  signals,  as  shown  in  Figure  2. 
Specifically,  SCFL  was  interested  in  determining  what 
the  cutoff  frequencies  were  for  Coast  Guard  broadcasts, 
so  that  we  would  be  able  to  arrive  at  a  better 
understanding  of  how  Coast  Guard  broadcasts  might 
interact  with  speech  recognition  technology. 
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Mote  that  0iost  speech  recognizers  input  data  fro» 
approximately  *500  Hz.  up  to  approximate! y  5-6  VHz.  Thus,  me 
wanted  to  determine  whether  or  not  roast  guard  signals  were 
within  this  range,  and  how  much  background  noise  these 
broadcasts  generally  contained. 

SCR  L  computed  signa  l- to-noise  ratios  from  RNS  amplitude 


measurements 

carried  out 

w  it  h  SCR  L  •  s 

in-h 

autocorrelation 

pitch  extraction  program.  Table  1  has 

si  g  na  1  -  to  -n  o  i  se 

ratios  computed 

for  sample  Coast  G 

broadcasts  analyzed  by  SCRL. 

TABLE  1 

Signal-to- Noise 

Patios  for  15  sample  Coast  Guard 

Broadca 

Signa  1-to-Noise 

Appro  xira  ate 

Broadcast  # 

Rat  io 

Cutoff  Frequencies 

1. 

21.87 

400-4000 

Hz. 

2. 

13.  12 

300-  3500 

Hz. 

3. 

18.00 

500-4000 

Hz. 

4. 

24.  74 

400-4000 

Hz. 

5. 

26.84 

300-3500 

Hz. 

6. 

25.75 

600-  3000 

Hz. 

7. 

24.90 

400-3500 

Hz. 

R. 

28.  46 

400-3500 

Hz. 

9. 

30.69 

500-3500 

HZ. 

10. 

29.  29 

700-  4000 

HZ. 

11. 

15.  80 

700-3500 

Hz. 

12. 

2  3.  72 

500-3500 

Hz. 

13. 

18.  15 

500-4000 

Hz. 

14. 

20.  10 

500-  4000 

Hz. 

15. 

21.26 

400-3500 

Hz. 

dean 

22.85 

Va  riance 

5.  13 
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Descriptive  statistics  were  performed  on  this  corpus  of 
data.  It  Mas  noted  that  the  15  sampLe  Coast  Guard 
broadcasts  examined  had  a  mean  signal-to-noise  ratio  of 
2  3dB. ,  with  a  standard  deviation  of  5dB.  With  regard  to 
cutoff  frequencies,  it  was  noticed  that  Coast  Guard 
broadcasts  fell  generally  within  the  300-4000  Hz. range. 
Approximate  cutoff  frequencies  were  used  to  compute  signal 
tandwidths.  The  term  "bandwidth"  is  generally  used  to 
describe  signals  which  are  confined  to  a  distinct  region  of 
the  frequency  spectrum.  Bandwidth  is  a  useful  terra  in  the 
present  context,  since  it  can  be  used  to  describe  the  width 
of  Coast  Guard  signals,  in  terms  of  their  Hertz  range. 
Sample  Coast  Guard  broadcasts  evidenced  a  mean  bandwidth  of 
3207  Hertz,  with  a  standard  deviation  of  315  Hertz. 

In  view  of  the  poor  signal-to-noise  ratio  which  was 
inherent  in  the  Coast  Guard  recordings,  it  was  not  practical 
to  do  further  acoustic  analysis  of  the  speech  signal.  It  can 
be  assumed  that  identification  of  phonetic  and  prosodic 
details  in  these  recordings  would  be  difficult  for  most 
analysis  techniques.  The  poor  signal-to-noise  ratio 

inherent  in  Coast  Guard  messages  received  would  drastically 
limit,  the  use  of  speech  recognition  systems  for  potential 
applications,  such  as  word  spotting.  The  transmission  of 
synthetic  speech  should  be  well  tested  to  be  certain 
acceptable  quality  is  maintained  regardless  of  the 
signal-to-noise  ratio. 
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CHAPTER  4 


OVERVIEW  OF  SPEECH  RECOGNITION  PRODUCTS  AND  TECHNOLOGY 

This  chapter  of  the  report  gives  an  overview  of  the 
speech  recognition  products  reviewed.  Basically,  SCRL  notes 
that  nearly  all  manufacturers  listed  recognition  accuracy 
rates  in  the  vicinity  of  99T  for  their  recognizers. 
Actually,  these  figures  must  be  approached  with  some 
caution.  Note  that  there  is  no  established  vocabulary  which 
has  been  consistently  used  to  compare  recognizers. 
Potential  users  of  a  speech  recognition  system  should  decide 
what  vocabulary  they  might  wish  to  use  with  it  and  actually 
try  this  vocabulary  out  in  the  field  under  real  test 
condit ions. 

Naturally,  some  vocabularies  are  much  easier  for 
recognizers  than  others,  and  results  vary  accordingly.  For 
example,  where  vocabularies  contain  words  with  very  similar 
phonemes,  such  as  "right"  and  "ripe",  items  can  be  confused. 
Also  note  that  different  speakers  may  influence  test 
results,  depending  on  their  cooperati  veness  in  training  and 
using  the  system.  Background  noise  can  affect  accuracy 
results,  particularly  if  the  noise  contains  nonperiodic 
sounds.  Variables  relating  to  recognition  accuracy  rates 
should  be  identified  in  advance,  and  their  potential  impact 


upon  recognition  accuracy  should  be  fully  considered  before 
actually  purchasing  any  particular  system.  lost 
manufacturers  are  very  cooperative  in  arranging 
demonstrations  and  evaluations  of  their  speech  recognition 
dev  ices. 

The  bar  graph  in  '  Figure  3  shows  general  prices  of 
various  recognition  systems  manufactured  by  Nippon  Plectric 
Company,  Threshold  Technology,  Interstate  Electronics, 
Votan,  Heuristics,  Centigram,  Auricle,  Scott  Instruments, 
and  Voicetek. 

This  bar  graph  basically  includes  top-of -t  he-line 
recognition  systems  for  ease  of  comparison.  As  can  be  seen, 
prices  for  recognition  systems  vary  from  several  hundred 
dollars  up  to  approximately  $13,000.  A  main  point  with 
respect  to  the  bar  graph  of  prices  is  that,  in  general,  the 
higher  the  price,  the  larger  the  recognition  vocabulary,  and 
the  faster  words  may  be  input  to  the  system.  Subsections  of 
this  report  detail  the  advantages  and  disadvantages  of 
various  systems;  these  should  be  considered  before  selecting 
any  particular  system. 

As  was  noted  previously,  the  Coast  Cuard  apparently 
reguires  a  completely  speaker  independent  recognizer  for  use 
in  spotting  keywords  in  distress  signals  of  connected 
speech.  As  this  report  notes,  there  is  no  such  system 
currently  available.  All  recognizers  which  were  evaluated 
require  training  by  specific  speakers.  Interstate 
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Electronics  does  market  a  speaker  independent  recognition 
chip  which  has  a  claimed  accuracy  rate  of  approximately  85* 


for  the  general  population,  using  digits  only.  We  single 
out  for  Coast  Guard  consideration  the  Verbex  1800 
recognizer,  which  handles  up  to  50  isolated  words  plus 
digits  and  control  words,  in  a  speaker-independent  fashion. 
In  fact,  we  suggest  the  possibility  of  using  this  recognizer 
to  handle  keywords  which  might  reasonably  be  assumed  to  be 
somewhat  isolated,  or  spoken  at  a  fairly  slow  rate. 

There  is  a  continued  effort  among  manufacturers  to 
develop  a  completely  speaker  independent.  voice  recognition 
system.  However,  it  should  be  two  or  three  years  before 
anyone  succeeds  in  marketing  such  a  system  which  is  capable 
of  meeting  Coast  Guard  needs  in  the  area  of  keyword  spotting 
in  distress  signals,  using  connected  speech.  There  are  many 
technical  problems  associated  with  developing  such  a  system. 
Obviously,  different  speakers  can  have  quite  different 
acoustic  manifestations  for  particular  phonemes, 
coarticulations,  overall  accents,  etc.  Another  obvious 
problem  is  the  difficulty  in  handling  speech  recognition 
using  connected  speech,  where  words  may  be  quite  different 
acoustically  than  their  isolated  acoustic  forms  and 
segmentation  of  word  boundaries  is  seldom  clear. 
Nonetheless,  manufacturers  are  competitively  developing 
speaker-independent  recognizers  which  can  handle  connected 
speech  input.  Advances  will  surely  be  made  in  this  area 
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within  the  near  future.  As  we  have  stated,  we  expect  that 
there  may  be  a  commercially  available,  speaker- independe  nt 
recognizer  capable  of  handling  a  relatively  large  vocabulary 
(approximately  100  words)  in  connected  speech  in 
approximately  two  or  three  years. 

4.1  AVAILABLE  SPEECH  RECOGNIZERS 

Data  concerning  speech  recognizers  were  basically 
compiled  during  the  period  of  19B1  and  19R2  froa 
manufacturers*  specification  sheets  and  brochures,  as  well 
a  3  from  direct  input  from  manufacturers  and  published 
articles  relating  to  speech  input  devices.  Thus,  much  of 
our  information  came  directly  from  information  we  requested 
from  manufacturers  of  voice  recognition  systems.  It  should 
be  noted  that  new  product  lines  may  have  been  introduced 
since  the  original  collection  of  the  data. 

There  is  an  increasing  variety  of  speech  recognition 
products  being  marketed  by  different  companies  for  various 
applications.  Price  and  performance  parameters  of  these 
different  devices  vary  considerably  depending  on  the  level 
of  the  products:  chip  level,  board  level,  or  system  level. 
They  also  vary  depending  on  the  type  of  recognizer: 
isolated  word  recognizer  vs.  connected  speech  recognizer  and 
speaker-dependent  recognizer  vs.  s pea ker- independent 

recognizer.  The  information  given  below  is  from  nine 
manufacturers  of  speech  recognizers:  1)Centigram, 

2)  Heuristics,  3)  Interstate  Electronics,  4)  Nippon  Electric 
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Company,  5)  Scott  Instruments,  6)  Threshold  Technology, 
7)  Verbex,  8)Voicetek,  and  9)  Votan.  The  information  received 
from  each  Manufacturer  was  not  always  complete  or  of  the 
same  type  of  material  given  by  other  manufacturers. 
Consequently,  it  is  difficult  to  provide  comparative  data 
for  all  speech  recognition  systems. 


Centigram 

Centigram  markets  speech  recognition  products  in  the 
systems  and  board  level  categories.  Their  Hike  recognizer 
is  basically  a  speech  recognition  and  response  terminal. 
The  Hike  is  noted  to  have  two  basic  operating  processes, 
voice  input  is  received  through  a  recognition  process  for 
training  reference  patterns  or  performing  recognition.  The 
results  of  recognition  are  transferred  to  a  host  computer 
throuyh  the  input/output  interface.  For  recording  audio 
response  messages,  voice  input  is  received  through  a 
response  process  with  the  unit  synthesizing  recorded 
response  messages  on  command  from  the  host  computer. 

Table  2  lists  the  main  features  of  the  Centigram  Hike 
along  with  operating  specifications. 

The  Hike  uses  a  spectrum  analysis  approach,  where  input 
waveform  data  are  either  stored  as  templates  or  compared  to 
existing  templates.  Note  that  direct  spectrum  analyzer 
output  is  available  with  the  Hike  unit. 
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TABI.K 


CenMqrjm  Mi  ke 


Features 

•  Recognition  vocabulary  ol  99  words  in  any  language 

•  Vocabulary  mask  command.  u>  specify  a  subset  of  active 
words  for  recognition 

•  l  \er  adjustable  accept  reject  thresholds  from  local 
keyboard 

•  Kecogmiion  wore  display  on  fnmt  panel 

•  Automatic  training  sequence  to  simplify  generation  of  refer 
erve  pancm-. 

•  Available  as  stand-alone  terminal  <w  in  electronic  s-onlv 
hi  *ard  son!  iguradon 

•  Stores  up  to  16  seconds  ol  audit*  response 


•  Powerful  system  coftwarr  commands  operating  in  ASt’ll 
character-oriented  protocol 

•  Dt/ect  spectrum  analyzer  output  available 

•  Word  framing  and  recognition  parameters  ad |U stable 
through  system  software  commands 


About  Centigram 

Centigram  Corporation  is  the  total  solutions  company  in 
the  field  of  digital  voice  technology  for  computers  and  com 
mumcattons  Centigram  s  state -ol -the -art  products  cover  the 
full  spectrum  of  man  mac  hine  communication  MIKF  listens 
(voice  in/.  LISA  talks  (voice  i*ou  and  VOPAC  communicates 
( voice  transmissKHi  > 


t 


Specifications 


I  >i  mens  ions 

height 

Power 


tnlertace 

Mk  rophonc 
In  pul 

F  clem.il  Speaker 
I >uipui 

VS  arrants  -  I  year 


Sund- Atom  l  nit 

Packaged  in  ^  jhmcl  w  ith  power  supply  kevtcurd 
and  I  <)  miertaic  vokc  response  optional 

.'2  x  29  x  Id  Vm  1 12  s  *  1 1  4x4  l  **  m  i 

'kg 

II'  2J0  \  AC  •  I  Vi 
5f*W  H/ 

12  W 

Both  K-hit  parallel  and  KN  2'2C  serial  included 

Balanced  1000  ohms 
20  ms  p  p 

'•pin  female  (Swiichcrafi  P'F  i 
K  ohms 

I  W  '/•'  phone  jack 


Recognition  Electronics  Only 

Single  card  I  l)  interlace  and  since 
response  optional 

2'  x  2'  x  I  cm  i  |o  x  10  x  0  4  in  i 
4  Vi  g 

♦  12V  •  ss  m  |  so  tn.i 

)2S  *  S',  iii  Ml  nia 

♦  s\  .  s*i  (,i  700  ma 

K  hii  hidireclional  unbuffered  /  MO  data 
bus  w  ith  lout  decoded  K)  address  strohe* 

Same  as  lor  stand  alone  unit 


Spec  ifications  subject  to  change  without  noiisc 


1 55 A  Moffett  Park  Drive.  Sunnyvale.  CA  94086 
telephone  (408 1  7.14- .1222 
tele*  171-994  SPEECH  SI  VL 
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Recognition  and  voice  response 


are  two  different 


procedures,  or  sets  of  algorithms,  as  the  Mike  unit  will 
provide  up  to  16  seconds  of  audio  response,  but  will  provide 
a  recognition  vocabulary  of  99  words  (in  any  language). 
Centigram  also  notes  that  recognition  electronics  are 
available  on  a  single  card,  with  voice  response  being  an 
option.  Also  available  is  the  complete  stand-alone  unit, 
with  voice  response  also  an  option. 

The  Hike  terminal  sells  for  $4,765  ;  the  unit  includes 
the  Kike  terminal,  head  microphone,  and  recognition  support 
software.  The  Kike  recognizer  can  be  used  in  conjunction 
with  Centigram's  Lisa  speech  synthesizer  (described  in  the 
speech  synthesis  products  section). 

Heur i^t ics 

Heuristics  markets  speech  recognition  products  which 
are  in  two  main  areas:  systems  level  products  (terminals) 
and  board  level  products.  Heuristics'  terminal  products  are 
moderately  priced,  ranging  from  approximately  $4,600  to 
$5, 000. 

3CRL  received  from  Heuristics  descriptive  brochures  for 
two  main  speech  recognition  systems.  First,  Heuristics 
markets  the  5000  Series  products  line.  This  product  line  is 
headed  by  the  5600  Voice  Terminal  System.  This  system 
features  a  Lear-Siegler  terminal,  a  voice  controller  board 
with  a  128-word  vocabulary,  disk  drive  and  cables,  and  a 
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noise-cancelling  Microphone.  The  5600  system  has  a  list 
price  of  $4,995.  Second,  Heuristics  markets  the  7000  Series 
products  line.  This  system  does  not  include  a  terminal.  It 
consists  of  a  voice  controller  board  (also  with  a  12fl-word 
vocabulary),  disk  drive  and  cables,  and  a  noise-cancelling 
microphone.  It  has  a  list  price  of  $4  ,595.  Note  that  these 
two  systems  may  be  purchased  in  part  or  in  whole,  as  needed. 

The  Heuristics  5000  Series  is  described  as  a  stand 
alone  intelligent  data  entry  device.  The  unit  utilizes  a 
spectrum  analyzer  using  digital  filtering  techniques  to 
analyze  audio  input  and  to  convert  it  to  a  digital 
representation.  Heuristics  states  that  proprietary 

algorithms  transform  the  data  into  a  reference  template. 
Reference  templates,  obtained  during  training,  are  compared 
with  those  from  audio  input.  when  a  match  occurs,  a 
user-defined  ASCII  string  is  sent  to  the  host  and/or  the 
terminal. 

The  following  is  a  list  of  features  of  the  50  0  0  Series 
recognizer : 

1)  728  word/phrase  vocabulary 

2)  each  word/phrase  may  be  up  to  3  seconds  in  duration 

3) 127  user-definable  vocabulary  subsets 

4)  A SCT I  strings  up  to  255  characters  in  length 

5) local  storage  eliminates  the  need  for  host 
programming 

6)  simultaneous  use  of  voice  and  key  entry 


7)  device  listens  continuously  through  wraparound 
speech  buffer 

8) compatible  with  several  languages,  including 

FORTRAN,  COBOL,  PASCAL,  and  BASIC 

9) automatic  self  test  and  fault  isolation 

10)  RS  232C  (20aA  current  loop)  serial  interface 
1  1)  50  -  9600  Baud 

12)  high  level  auxiliary  input  for  telephone  or  tape 
recorder  (1  VRdS) 

13) single  board  unit  easily  installed  in  Lear  Siegler 
terainals 

Table  3  provides  a  listing  of  Heuristics  operation 
specifications  for  the  Series  5000. 

The  Heuristics  Series  7000  speech  recognizer  is  similar 
to  the  Series  5000  device,  but.  does  not  include  a  computer 
terminal.  It  is  designed  for  use  in  conjunction  with  any 
ASCII  terminal.  This  recognizer  utilizes  the  sane  basic 
recognition  approach  as  the  Heuristics  5000  Series  units. 
It  also  shares  the  same  operation  specifications  with  the 
5000  Series  units,  as  listed  in  Table  3. 


Interstate  Electronics 

Interstate  Electronics  is  one  of  the  oldest  companies 
involved  in  speech  recognition,  and  currently  manufactures  a 
rather  wide  array  of  speech  recognizers  and  related 
products. 


41 


TABT.K  3 


H  n  u  r  J  5 i  f  J  <  ■  s 


Sr' 


r 


) 


(.'S 


r,0()(l 


AN  INTELLIGENT  DATA  ENTRY  DEVICE 

Htu'isiics  Senes  5000  is  a  stand  alone  intelligent 
speech  dais  entry  device  used  m  conjunction  with 
tn«  L»ar  Siegis'  ADM  3  and  5  terminals  It  completes 
the  man  machina  .ntariaca  through  speech.  tha  most 
natural  'o»m  of  communication 
M  s  natural  m  s  simple  it  s  tast  it  s  accurate 
Eacn  jnn  has  a  spectrum  analyzer  that  uses 
steie-d  the  art  digital  filtering  techniques  to  analyze 
*ud<o  mput  and  convert  it  to  a  digital  representation 
Heuristics  proprietary  algorithms  transform  this 
data  mio  a  compact  characteristic  reference  tern 
o<afe  ?hese  templates.  obtained  during  training,  are 
compared  during  recognition  with  the  audio  input 
When  a  match  occurs  a  use>  defined  ASCII  string 
■S  sent  to  the  host  and/or  the  terminal 

NO  HIDDEN  COSTS  OF  SOFTWARE  DEVELOPMENT 

Through  the  use  of  <ocai  storage  media  on  the  disk 
drive  models  there  is  no  hidden  cost  of  software 


FEATURING 

•  128  word/ phrase  vocabulary 

•  Each  word/phrase  up  to  three  seconds  in 
length 

•  127  user  definable  vocabulary  subsets 

•  ASCII  strings  up  to  255  characters  in  length 

•  local  storage  eliminates  the  need  for  host 
programming 

•  Simultaneous  use  of  voice  end  key  entry 

•  Listens  continuously  through  wraparound 
speech  buffer 

•  Compafibie  with  all  languages  including 
FORTRAN,  COBOL.  PASCAL.  BASIC 

•  Automatic  self  test  and  fault  isolation 

•  R5  232  C  (20m A  c urrenf  loop)  Serial  interface 

•  50  to  9600  Baud 

•  High  level  auxiliary  input  for  telephone  or 
>ape  recorder  (t  VRMS) 

•  Smgie  board  unit  easily  installed  »n  LSI 
ADM  3  and  5  terminals 


development  because  all  systems  integration  19 
eliminated  Local  storage  eliminates  the  need  to 
write  code  to  save  and  restore  vocabularies  m  the 
host  computer 

The  user  has  complete  flexibility  m  defining  the 
characters  to  be  associated  with  each  utterance 
An  ASCII  String  up  to  255  characters  m  length  can 
be  assigned  to  any  utterance  during  training  Once 
again,  this  eliminates  hidden  software  costs  by  re¬ 
moving  the  need  tor  creating  look  up  tables  m  the 
host  computer 

IDEAL  FOR  HANDS  BUSY/EYES  BUSY 
APPLICATIONS 

With  Heuristics,  the  first  company  to  develop  board 
level  and  completely  self-contained  speech  recogni 
tion  terminals,  you  can  instruct  your  computer 
verbally,  freeing  your  hands  and  eyes  for  Other 
tasks  it's  faster,  ampler  and  t83%  more  accurate 
than  manual  entry 


SERIES  5000  PROOUCTS 
MOO  VOICE  TERMINAL  SYSTEM 
Lear  Siegler  ADM  5  Terminal 
Voice  Controller  Board 
with  128  word  vocabulary 
Disk  Drive  and  Cables 
Noise  Cancelling  Microphone 
5400  VOICE  TERMINAL  SUBSYSTEM 
Voice  Controller  Board 
with  128  word  vocabulary 
Disk  Drive  and  Cables 
Noise  Cancelling  Microphone 
5300  VOICE  TERMINAL 

Lear  Siegler  AQM-S  T ermine* 

Voice  Terminal  Bond 
with  128  word  vocabulary 
Noise  Cancelling  Microphone 
S200U  0ISK  DRIVE 

Cables  and  Disk  Controller  for 

upgrading  model  5000  lo  model  5400 
5000  VOICE  TERMINAL  BOARD 
with  128  word  vocabulary 
Noise  Cancelling  Microphone 


Typical  applications  using  speech  recognition 
today  are 

•  Process  Control 

•  inventory  data  entry  or  inquiry 

•  Word  processing  terminal  control 

•  Credit  Verification 

•  Quality  Control  and  inspection 

•  Automated  test  equipment  control 

•  Hospital  room  control 

•  Source  entry  of  measurement  data 

•  Executive  data  base  inquiry 

•  Computer  control  for  the  handicapped 

•  Automated  microscope  control 


SPECIFICATIONS 
Terminal  Disk  Drive 

Environmental 

Ambient  Temperature 

Operating  0*Cto50*C  4*C«o4®*C 

Storage  40*Cto65*C  22*Cto47*C 

Humidify  T0%to90%  20%  to  80% 

Physical  Ofmenefone 

Width  15  60  m  '39  60  cm»  6  00  m  tl5  24  cmi 

Depth  20  20  m  -5i  30  cmj  13  00  m  (33  02  cm) 

Height  13  50  m  1 34  30  cm)  3  75  m  (9  53  cmt 

Weight  34  42  ibs  <15  64  kg)  8  50  lbs  (3  85  kg) 

Electrical 

ah  power  supplied 

through  ADM  Terminal  ii5V  230V  t  10% 
115V  230V  t  10%  50/60  Hz 

50/60  Hz  15  Watts 

50  Watts 

Audio  Input 

Low  level  5  MV  RMS.  600  ohm  impedance  e 
Connector  B3F  Switchcralt  3  pm  female 
High  level  1  volt  RMS.  1000  ohm  mpedance 
Connector  '  *  mch  phone  jack 
RS  232  C  Connectors:  DB  25 
Recognition  Rate:  99  ♦  por.tnr 
Warranty-  One  year  for  at>  parts  and  labor 
an  specifications  subiect  to  change  wr'howt  notice 

*«  •MlK'N  ■'  itM  'troll*  Kt 


H  ELRJSTICS  ))))■ 


CORPORATE  OFFICE 
1285  HAMMERWOOO  AVENUE 
SUNNYVALE  CALIFORNIA  94086 
406/734-8532  TWX  172180 


EASTERN  REGION  At.  OFFICE 
i»5  MAIN  STREET 

PORT  WASHINGTON  NEW  YORK  11050 
516/944 7575  TWX  $49233 
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Interstate’s  speech  recognition  products  are  broadly 
divided  into  3  areas:  1)  terminal  products,  2)  board 

products,  and  3)  voice  semiconductor  products. 

Interstate  Electronics  markets  two  basic  voice  entry 
terminals.  One  of  these  is  the  VRT101  voice  recognition 
terminal.  This  unit  has  a  100  word  recognition  vocabulary, 
and  boasts  a  99  accuracy  rate.  The  unit  includes  a  Z80 
microprocessor,  a  48K  memory,  and  a  100K  floppy  diskette 
drive.  The  VRT101  sells  for  $5,295,  with  quantity  discounts 
available.  Interstate  also  markets  the  VRT103  fully 

integrated  voice  recognition  terminal,  which  is  the  same  as 
the  VRT10  1,  but  includes  two  additional  floppy  diskette 
drives  with  a  300K  memory  capacity.  The  vrt  10  3  sells  for 
$6,595  in  single  units.  Following  are  Tables  4  and  5 
covering  the  specifications  of  the  VRT10  1  unit.  It  should 
be  noted  that  a  wide  variety  of  options  are  available  for 
Interstate’s  speech  recognition  terminals. 

Interstate’s  board  products  include: 

1)  VPM041  voice  recognition  module.  This  unit 
includes  a  40  word  voice  recognition  vocabulary  with 
99  T  accuracy,  and  RS232-C  interfacing.  The  VRM041 
sells  for  $1,790  in  single  units. 

2)  The  VRM102  voice  recognition  module.  The  unit 
features  a  100  word  vocabulary  with  99%  accuracy,  and 
2  serial  RS232-C  or  20-mA  asynchronous  interfaces. 
It  sells  for  $2,  255  in  single  units. 


4  3 


1)  The  V  RT  200  voice  recognition  nodule.  The  unit 
permits  voice  recognition  with  the  popular  Lear 
Siegler  A  DM  3A  and  ADH  5  terminals.  The  unit 
features  a  maximum  of  100  isolated  words  or  phrases 
and  recognition  accuracy  of  99*  or  tetter.  The  VRT 
200  also  includes  a  user  programmable  reject  level. 
The  V RT 200  sells  for  $2,100  in  single  units. 

Tables  6  and  7  provide  a  listing  of  specifications  for 
the  Interstate  VRT200  voice  recognition  terminal. 

Table  8  indicates  specifications  for  the  Interstate 
V R HO 4 1  and  VBH102  voice  recognition  modules 

Interstate  markets  a  single-board  speech  recognition 
module  for  DEC  Q-Bus  equipment,  called  the  VRQ400  voice 
recognition  module.  This  board  costs  *2,120.  It  features  a 
100  word  recognition  vocabulary,  trainable  for  any 
vocabulary  in  any  spoken  language.  The  unit  is  usable  with 
direct  microphones,  wireless  microphones,  or  via  telephone. 

A  new  product  from  Interstate  is  the  VRC00R  voice 
recognition  chip.  This  chip  features  an  8-word  vocabulary 
and  is  speaker-independent.  An  accuracy  rate  of  85*  is 
given  for  the  general  population.  The  chip  has  an  initial 
charge  of  $25,000  for  tooling  and  mask  generation.  After 
this  initial  tooling  charge,  the  VRC008  sells  for  $22.50, 
with  discounts  for  quantities  over  25  ,000  units.  Table  9 
provides  a  listing  of  specifics  regarding  the  VRC008. 
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TABLK  4 


intcrs'-d’-o  Klo'-*-ronj'’S  VRT101  Voice  Rocoqm  Hon  Terminal 


INTERSTATE 

ELECTRONICS 

CORPORATION 


VOICE  RECOGNITION  TERMINAL 

Model  VRT101 


Vwcf  rveof  nitlofi  i  hilly  ml ef rated  into  InfvnUtvi  diskette- 
based  VI T 101  tntelltfenf  voict  terminal. 


•  Direct  voice  interaction  with  application 
software 

•  Supports  a  variety  of  application  programs 
and  higher  level  languages 

•  100- word  resident  vocabulary 

•  99%+  accuracy 

•  Utility  software  for  immediate  use  of  voice 
recognition  functions 

•  Self-test  for  fault  isolation 


MODEL  VBT101  SPECIFICATIONS 
CPU  and  Memory 
ProceMor:  280 

Clock:  2  048  MHJ 
Memory:  48K  bytes  SAM 


Display 

C1T:  1 2  inches  diagona i  P4  phosphor 
Display  Formal:  2<  lines  oi  80 characters  phn  25th  user- 
status  line 

Display  Sire:  6  5  inches  high  »  8  5  inches  wide 
Character  Slae:  0  2  inch  high  «  0 1  inch  wide  (approximate) 
Character  Set:  '28  95  ANSI!  plus  33  graphics! 

Character  Type:  5  x  7  dot  matn«  lupper  case) .  5  x  9  dot 
matn»  ilower  case  with  descenders! 

Keyboard:  72  Levs  '60  alphanumeric,  12  function  control) 
plus  a  1 2-kev  numeric  pad 
Cursor  Blinking  or  reverse  video  block  or  o fi 
Cursor  Controls:  Up.  down  lett.  right,  home.  CR.  IF.  back 
space  and  tab  trom  keyboard  or  computer. 

Cursor  Addressing  Relative  and  direct. 

Tab:  Standard  8-column 

Refresh  (ate:  60  Hr  at  60  Hi.  50  Hi  at  50  Hi  line  frequency 
Edit  Functions:  Insert  and  delete  character  or  line 
Erase  Function*:  [rase  line  from  beginning  of  Hne  to  end  of 
line,  erase  page  from  beginning  of  page  to  end  of  page 
•eft:  Audible  alarm  on  receipt  of  ASCII  BEL 
Video:  Normal  and  reverse  bv  character 


Serial  Input/Outpul  Ports  (2) 

Interface:  ElA  RS-232C  at  data  rates  of  110  to  9600  bits 
second 

Communication  Mode:  full  or  half  duplex 
Paifty:  Even  odd  or  none. 

Disk  Systems 

Built-in:  5-1  *-inch  floppy.  100K  bytes 

VRTDK2:  Two  external  5-1/4-mch  floppies,  200k  bytes. 

Software 

CP/M:  Operating  svstem  software 
BASIC:  Microsoft 
FOBTRAN:  Microsoft. 


TAB1.K  'j 
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Mechanical 

Dimensions:  13  inches  high  *  17  inches  wide  s  20  Inches 

deep 

Weight:  54  pounds. 

Environmental 

Operating  Temperature:  10  to  35*C 
Storage  Temperature:  0  to  35’C. 

Power 

1 20/240  volts  at  50/60  He  at  90  watts  maximum 

Voice  Recognition  Performance 
Vocabulary  Size:  100  isolated  words  and/or  phrases 
Recognition  Accuracy:  99*  percent 
Refect  Threshold:  User  selectable. 

Longest  Utterance  Duration:  1.25  second 
Minimum  Retween-Word  Pauses:  160  milliseconds  (user- 
programmable:  40  to  320  milliseconds! 

Minimum  Word  Length:  80  milliseconds  (user- 
programmable:  80  to  1 60  milliseconds) . 

Processing  Time:  (25-rNI  milliseconds,  where  N  a  active 
vocabulary  size,  following  detection  of  the  end  ol  word 


Voice  Utility  Commands 

1  T  rain 

2  Update 

3  Reset 

4  SetRTHl 

5  Read  RTHl 

6  Download  reietence  patterns  only 

7  Upload  reference  patterns  only 

8  Download  reference  patterns  and  ASCII  strings  |Oir : 

9  Recognition  with  common  vocabularv 

10  Recognition  ol  non-contiguous  vocabularv  host 
mode  only) 

11.  Test  (standalone  mode  onlv) 

1 2.  Wtite  word  boundary  parameter 

13.  Read  word  boundary  parameters 

14  Set  Operational  Mode  standalonr-0  host- 1 1 

15  Set  Cain 
16.  Read  Cain 

17  Compare  Reference  Patterns 
18.  Self-test 


CP/  Mr“  Is  a  trademark  ot  the  Digital  Research  Corporation 
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1  n  t  e  r  s  *  d  f  e  VRTIMiO  Vo  i  r-  r>  R  t  -  -  -  n  n  p  1  t  ion  Terminal 


INTERSTATE 

ELECTRONICS 

CORPORATION 


VOICE  RECOGNITION  TERMINAL 

Model  VRT200 


Meek  Digram  af  ADM  JA  TmW  wffk  V*T JM  Vefcv 
bofuMw  CspakUtty 


Accurate.  Low-Cort  Automatic  Speech  Recognizee 

The  VRT200  a  a  tingle  phnted-circuil  board  speech  recog 
nuer  with  a  vocabulary  of  100  wordt  or  short  phrases 
designed  specifically  lor  use  m  the  Lear  S  leg  lei  ADM  JA  and 
ADM  5  Dumb  Terminal*  Video  Displays  With  99*  percent 
accuracy,  the  VRT200  allow i  Lear  Siegler  Dumb  Terminal* 
inert  to  nput  commands  or  data  via  voce  and/or  keyboard, 
thus  providing  maximum  operator  efficiency  fix  data  entry, 
retrieval,  and  log-on. 

The  VRT200  it  a  total  hardware/software  system  designed  tor 
easy  installation  without  modification  to  existing  application 
software  All  VRT200  logic  is  contained  on  a  single  printed- 
cetuit  board  that  has  been  specifically  defined  lo  III  the 
ADM  JA  and  ADM  J  and  can  be  installed  without  soldering 
or  special  tools  An  ADM  JA  or  equivalent  fitted  with  a 
VRT200  board  enmediatafy  adds  voice  nput  capability  to 
already  operational  data  entry,  process  control,  or  manage¬ 
ment  information  systems 

The  VRT2O0  allows  direct  microphone  input  via  either  a 
boom-mounted,  kghtweqht.  ncxse-canceflmg  microphone 
or.  at  the  user's  option,  a  table-mounted  microphone.  The 
microphone  has  a  standard  five-foot  cord,  but  longer  cables 
are  available  if  more  freedom  of  movement  is  required.  Using 
(he  VRT200  frees  the  operator  from  the  need  to  return  to  a 
fued  workstation  to  enter  data,  thus  mcreasetg  operator  effi¬ 
ciency 


Efficient,  Real-Time  Performance 

Speech  input  is  ana  tyred  by  a  16-channel  spectrum  analyzer 
and  converted  to  a  digital  icpresentation  of  the  spoken  nput 
Thu  digital  dau  is  then  converted  to  a  fixed-net  pattern  that 


•  Single-board  speech  recognition  module 

o  Add*  voice  input  capability  to  ADM  3A  and 
ADM  5  Dumb  Terminal*  Video  Display* 

o  No  special  programming  required 

o  100-word  vocabulary 

o  00%  +  accuracy 

o  Selectable  decision  threshold  tor  rejection  of 
unwanted  Inputs 


preserves  the  information  content  of  the  spoken  inputs  while 
discarding  redundant  features  During  vocabulary  training 
these  patterns  are  used  to  derive  templates  lot  each 
utterance  The  templates  are  then  used  in  the  recognition 
process  lor  companion  with  incoming  spoken  words  Vocab¬ 
ulary  templates  are  stored  in  an  onboard  random-access 
memory  (RAMI  while  the  processing  algorithms  are  con¬ 
tained  in  an  onboard  read-only  memory  IROM)  operating  in 
conjunction  with  a  microprocesto'  When  an  utterance  is 
recognized,  a  user-defined  ASCII  stnng  is  then  sent  to  the 
host 

Trainable  to  Individual  Voice  Characteristics 

The  VRT200  is  a  speaker-dependent  voice  recognition 
device,  which  requires  that  each  user  give  a  sample  ol  the 
words  and  phrases  m  the  vocabulary  betore  the  VRT200  will 
recognize  the  user's  voice  The  process  of  generating  these 
samples,  ot  reference  patterns,  is  called  vocabulary  training 
Once  a  reference  pattern  set  has  been  built,  it  can  be 
uploaded  to  the  host  computer's  mass  storage  Later,  the  user 
can  download  the  reference  patterns,  allowing  the  terminal 
to  recognize  the  same  words  without  the  need  lor  retraining 
With  each  VRT200,  software  is  supplied  m  FORTRAN!  and 
SASIC  languages  demonstrating  the  host-resident  code 
necessary  to  perform  the  upload/download  operation 

The  VRT2Q0  supports  three  training  modes  (1)  noimal  train¬ 
ing  in  which  the  vocabulary  is  cleared,  then  trained  a  selected 
number  of  samples  (2)  updating  of  word  patterns  in  which 
the  stored  reference  patterns  for  the  specified  vocabulary  »re 
augmented  by  additional  training,  and  IJl  a  single-wo'd 
retrain  mode  m  which  the  single  word  will  be  trained  the 
same  number  of  samples  as  the  word  it  is  replacing 
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Specifications 


INTERSTATE 

ELECTRONICS 

CORPORATION 

.001  u m  Bai  RokI 

P0  Bo»  1117 

4nan#im  <  'aakxnia  *#2810 

’14M6  72: 0  .8001854-6979 

rwx  'ihtv#i  \vn 

Vie*  *5544.1  &  *55419 


•W*SferT  »  >V* 

.91V,  ca**  Bad  Road 
*nahe»n>  CaatumMl  92803 
~14i  M5  7210 


ADM  3A  and  ADM  5 

art  ScfMo. 

12  inch  1 30  f >  tm.  diagonal  P4  phot 
phot  non-gjar*  • urtec*  5  8  -aches 
il4  7cm)  h*<^  *  8  3  inches  (21  1  imi 
Mde  DtaftUy  !920  .-haracttrs  80 
char  Una  by  24  lima 


ADM  5  — 128  ASCII  characters  upper 
xww  cm*  lAtncTuatton  control  charge 
ter*  ADM  J4— 64  ASCII  characters. 
dl*P*ay*d  as  upper  *m  plus  punrtua 
5on  and  control 
Ckermimr  Fom. 

Character  mam*.  a£m  5-  -  •  •>  »  9  dot 
rnacru  including  Kill  l  dot  descenders 
i  1  68mm  wide  *  o  S3  mm  hxjii  ADM 
1A  —  5  *  7  Jot  matrix  •  1  88mm  »tde  * 

4  77  mm  I 


VRT200 


VorsAMiary  Slav 

Maximum  100  taoiaied  word*  or 

phrases 

Parr  sin  HecofwWow  Accursc 

99  pert  mi  or  bettet 
Reject  TVtsImlJ. 

Uaar  program  madia 

Lowfs  t  Utterance  Duration 

l  25  seconds 


IbO  rmllaaconds 


Microphones 


VMK0I0: 

Noiaa  cancelling  microphone  micro 
pnon*  mounted  on  aluminum  head 
set  includes  ON  ■  OFF  switch  lor  audio 
input  control 


Sonhweasam  Office 
500  Airport  Bvd  Suita  110 
Burlingame  CA  94010 
-.15)  342  8624 
'  antral  Of&ca 

1375  Retrangaaci  Road.  Suits  M 
Schaumburg.  11.  60195 
■312!  843  7233 
Eastern  <.)ffica 

1745  ertfaraon  davtt  Hwy  Suita  601 
Arlmgsnn  VA  22202 
7031892  1400 


Raver**  bloc*  Homes  to  upper  i»h  t 
screen  v)paonai  switcn  selectable 
undertone  cursor  homes  tu  bottom  lan 
Switch  selectable  non  destructive 
*pace  altar  carnage  i  a  turn 

i/leuei  Attrlfewies. 

Reverie  video  reduced  intanefy  and 
•averse  wJao  reduced  intan  si  tv  com 
‘'inaoon  — ADM  5  only 

Keyboard: 

ADM  5 — 83  keys  26  letter  alphabet 
with  upper  lower  rasa  numeric 
Keypad  punctuaaon  <aps  kick.  cursor 
.ontroi  AM  »*\s  are  auto  repeaong 
£2  char  sec  )  ADM  iA —  59  keys 
2b- tetter  aipbapat  with  upper  rasa 
"imertc  0  throu^r  9  punctuation 
ontroi  Two  Key  repeat  operaoon 
22  chat  sec 


Minimum  Word  Length 

-hJ  mdbsaconds 

Appradiasts  Rsapowee  Thme: 

25  ♦  N)  milliseconds  where  N  *»  ac 
ttva  vocabularv  MM  ilolowtnq  and 
it  word  detection  i 
Sofuemre: 

Provides  vocabulary  reterence  partem 
jpload/ download  toi  DEC  RT  11 
DEC  RSX-llM.  Data  General  RDOS 


VNKOIf; 

Nona  cancelling  microphone  with  ear¬ 
phone  Same  ns  VMK010  except  in¬ 
cludes  earphone  tor  voice  response 

VWK845: 

Stand-mounted  microphone  Cardioid 
pickup  pattern  with  gooseneck  and 
rtand  induding  ON- OFF  switch 


IS) 

Lear  Sleqiar.  Inc 
Data  Products  OMakm 

714  North  Brookhurst  Street 
«'naheim  Caklomsa  9280.3 
714)  774-1010 

TWX  910-591  1157  Telex  65  5444 


l S'nverxeoon  nvxle  *uU  tiait  duplex 
InUerfmcmm: 

RS  232 C  point  to  pomi  or  20mA 

cunent  loop 

OsM  Rate*. 

75-19  200  Pvrty  Evan  odd  marx 
space  or  none 

Word  Structure: 

Data  7  or  6  bits.  1  start  bit  1  >>f  2  str 
bits 

Lrtenstoa  Port: 

RS  232C  port  t-n  mtartaarxj  wnai 
aaynchmnous  davtcas 


Stinted  Ptjumr  Requirement*: 

115  Voto  r  ll>%  nO  Hr.  b0  warts 

OpCiostef  Poear  KeqsiJnrmente. 

230  Volts  =  10%  50  60  H* 

Width: 

12  S  inches 

Heffbu 

10  5  inches 

Opernalrtf  Environment 

5"  to  bore  Ar  to  12 2*p  5  ro  to  95% 
relaove  humidltv  without  condensa¬ 
tion  10  000'  (3km  i  max  altitude 


VMKS77 

Hand -held  microphone  For  use  in 
applications  where  contjnuai  voice 
entry  is  not  required  Contains  none 
cancelling  element  and  PUSHTO- 
TALK  switch 


Boston  (6171  89a 7093 
Chicago  (312*  279  5250 
Houston  (713)  780  25KS 
L.«  Angelas  (213*  454-9941 
New  York  t8U0'  523  5253" 
Drtando  <3U5i  869  I82h 
Philadelphia  (215)  245-1520 
San  Francisco  t415l  828-b941 
Washington  DC  800)523  5253* 
England  (04867)  80666 

•HOI)  number  also  includes  CT  Dt. 
MA  MD  NJ  NY  R1  VA  &  WV 


TABI.K  H 


i  n  •  ••  r  -  '  1 4  1  ,  i  n  u  VP  Ml  0?  Vo  i  K<*  oni.j  >  ion  Mod  s..  loj 


INTERSTATE 

ELECTRONICS 

CORPORATION 


VOICE  RECOGNITION  MODULE 
Models  VRM041  and  VFIM102 


nngt«-bo«rd  Vote*  Recognition  Module 

Accurate.  Low -Colt  Automatic  Speech  Recognizer 

’he  Voice  Recognition  Moaule  VRM*  s  a  single  pnnted- 
cirti.it  hoard  speech  recognizer  capable  ot  recognizing  as 
many  as  UX  worus  or  snort  phrases.  It  is  easiiv  interfaced  to 
an  external  system  using  either  parallel  or  serial  .menaces 
"he  sena1  interfaces  are  switcn  selectable  to  R$*32-C  or  *0- 
mA  ^  j/reot  loop  The  VRM  inciuaes  ail  the  logic  and  memory 
recessarv  to  perform  training  word  recognition  and  the 
.  « >mmun»cat»on  protocol  independent  ot  the  user  s  mode  of 
operation 

The  VRM  contains  a  micropnone  preamplifier  and  a 
preamplifier  bypass  switch  to  allow  direct  microphone  input 
using  d  vghtweight  headset,  boom- mounted,  or  hand-held 
micropbooe  Alternately,  an  audio  signal  mav  bypass  the 
onboard  preamplifier,  which  allows  a  remote  microphone 
and  preamplifier  to  be  utilized  without  the  >oss  of  audio  sig¬ 
nal  integrity  The  input  <s  AC-coupled  and  terminated  by  a 
*0-Kiiohm  resistance  The  usetui  audio  bandwidth  of  the 
VRM  is  from  :00  to  "000  Hz.  Excellent  recognition  is  attaina¬ 
ble  with  the  reduced  telephone  bandwidths 

Highly  Accurate  Real-Time  Operation 

"he  input  speech  *s  analyzed  by  a  lb-channel  spectrum 
.maivzer  and  ionverted  to  a  digital  representation  of  the 
characteristics  of  the  spoken  mput  This  digital  data  is  then 
inverted  to  a  hxed-srze  pattern  that  preserves  the  mforma- 


•  99%+  accuracy 

•  40-  and  100- word  vocabularies 

•  Highly  accurate  real-time  operation 

•  Trainable  for  any  vocabulary  in  any  spoken 
language 

•  Multibus  form  factor 

•  Usable  with  direct  microphones,  wireless 
microphones,  or  via  telephone 

•  User  selectable  rejection  of  poor  input  match 

•  One  parallel  and  two  serial  ASCII  input/ 
output  ports 

•  User  control  of  recognition  parameters 


tion  content  ot  the  -poknn  nouts  while  discarding  redundant 
matures  During  word  training,  these  patterns  are  used  to 
denye  templates  for  each  vocabulary  item  These  templates 
are  used  n  the  recognition  process  tor  comparison  with 
incoming  -spoken  words  Vocabulary  templates  are  stored  in 
an  onboard  random-access  memory  tRAM>.  while  the  proc¬ 
essing  algorithms  are  *.  untamed  m  an  onboard  read-only 
memory  ROM'  operating  m  conjunction  with  a 
microprocessor 

The  VRM  nas  two  framing  modes  I1  normal  training  *n 
which  the  vocabulary  storage  is  cleared  and  a  new  vocabu¬ 
lary  is  trained  bv  speaking  rhosen  words  a  selectable  number 
of  times,  and  2;  updating  of  word  patterns  in  which  the 
stored  reference  patterns  for  the  specified  vocabulary  are 
augmented  bv  additional  training 

The  VRM  automatically  rejects  utterances  during  training  that 
do  not  sufficiently  agree  with  the  same  utterance  from  pre¬ 
vious  training  samp/es  of  that  word.  This  prevents  significant 
alteration  of  a  vocabulary  reference  pattern  due  to  spurious 
noise  -bumping  the  microphone,  door  closure,  coughing, 
speaking  inconsistencies,  or  simply  tailing  to  utter  the  vocab¬ 
ulary  m  the  specified  sequence)  Thus,  it  mav  be  necessary  to 
repeat  an  utterance  before  being  prompted  to  the  next 
sequential  utterance. 
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VOICE  RECOGNITION  CHIP 
Model  VRC008 
(advance  infomiation) 
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Com** 
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•  low-coft  voice  recognition 

•  8-word  vocabulary 

•  Speaker  independent 

e  85*'.  accuracy  for  general  population 
e  Up  to  95%  accuracy  for  specific  speakers 

•  NMOS  and  CMOS  versions 

•  8  parallel  I/O  lines 

•  Single  +  5- volt  power  supply 

A  low-Cost  28-Pfn,  Single-Chip  Voice  Recognition 
System 

"testate  «  *mg»e-<hip  VRC008  system  employs  a 

.f'^Que  ^erbtxJ  for  processing  of  analog  speech  data  and 
’N  'mmttor  <;t  spoken  utterances 


Designed  for  a  wide  variety  ol  high- volume  consumer 
applications,  this  microcomputer  proyides  'ow-cost  voice 
control  capability  for  appliances,  toys  games,  and  other 
voice  automation  products  The  system  s  speaker-indepen¬ 
dent  and  recognizes  with  high  accuracy  e*ght  spoken  words 
or  phrases,  translating  verbal  commands  ■  e  walk  stop 
channel  tour  turn  ngnt.  etc  into  action  via  associated 
circuitry  in  a  typical  application  'wake  up  activates  me 
system  into  a  receptive  mode  and  prepares  <r  :o  accept  nput 
speech:  the  word  relax  *  stops  the  svstem 

Programmable  for  a  selected  vocabulary  the  VRC0O8  og* 
mzes  speech  bv  detecting  the  state  <eQueme  ,r  tj.n 
voiced  and  unvoiced  parameters  in  the  incoming  *,«  a  or 
phrase  and  companng  this  sequence  with  the  stored 
sequence  ot  a  prespecified  vocabulary  With  recogn.ton 
accomplished,  the  svstem  then  outputs  a  bit  pattern  ror  the 
word  number  identified.  The  state  sequence  and  recognition 
parameters  are  stored  m  the  on-chip  ROM 

Interstate  customizes  the  VRC008  to  specific  user  vocabul¬ 
aries.  In  this  process  the  customer  defines  the  particular  func¬ 
tions  to  be  performed  bv  his  product  and  l£C  provides  assis¬ 
tance  <n  selecting  a  vocabulary  suited  to  those  functions 
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Interstate  markets  the  VRC100-1  voice  recognition  chip 
set.  This  chip  set  sells  for  $305.  Basically,  this  set 
consists  of  two  chips  to  be  used  as  building  blocks  for 
speech  recognition  systems  capable  of  recognizing  as  many  as 
100  words  or  short  phrases.  Tables  10  and  11  indicate  a 
diagram  of  a  suggested  setup  using  the  VRdOO-1  chip  set. 

The  VRC100-1  chip  set  nicely  typifies  Interstate's 
approach  to  speech  recognition,  which  is  briefly  set  forth 
be  low . 

Speech  input  is  analyzed  by  a  16-channel  spectrum 
analyzer  and  converted  to  a  digital  representation  of  the 
characteristics  of  the  spoken  input.  The  digital  data  are 
then  converted  to  fixed-size  templates  which  preserve  the 
information  content  of  the  spoken  input  while  discarding 
redundant  features.  During  training,  stored  patterns  are 
used  to  derive  templates  for  each  word  pattern.  These 
templates  are  next  used  in  the  recognition  process  for 
comparison  with  incoming  speech  templates.  Presumably, 
incoming  templates  are  correlated  with  stored  templates  for 
actual  word  recognition.  Vocabulary  templates  are  stored  in 
the  external  Ron,  while  processing  algorithms  are  contained 
within  the  speech  analyzer  device. 

As  this  report  was  in  preparation.  Interstate  added 
another  voice  recognition  module  to  its  voice  input  product 
line.  This  is  the  VRT300,  which  is  reported  to  have  a  100 
word  vocabulary.  The  unit  is  designed  to  be  a  single 
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VOICE  RECOGNITION  CHIP  SET 
Model  VRC100-1 
(advance  information) 


•  Speech  recognition  chip  set 

«  Highly  accurate  real-time  operation  (99%  +) 

•  100-word  vocabulary 

•  T rainable  for  any  vocabulary  in  any  language 

•  Two  training  modes  -  train  and  update  (all  or 
part  of  the  vocabulary) 

•  Usable  with  direct  microphone,  wireless 
communication,  or  telephone 

•  Selectable  decision  threshold  for  rejection  of 
unwanted  inputs 

•  Input/output  port  configuration  for  easy 
product  integration 

High-Accuracy  Voice  Recognition  in  a  Chip  Set 

Interstate  s  Model  VROOO-1  voice  recognition  chip  set  con¬ 
sists  *>i  two  <  hips  used  as  building  blocks  tor  speech  recogni¬ 
tion  systems  capable  of  recognizing  as  manv  as  100  words  or 
short  phrases  'see  illustration)  These  two  integrated  circuits 
are  designated  100-1 A  and  100-18 

The  100-1 A  i  hip  is  a  28-pin  integrated  circuit  providing 
audio- spectrum  analysis  over  the  range  ot  intelligibility  tor 
»peeth  200  to  7000  Hz  The  analog  input  to  the  100-1 A  is  $ 


volts  rms  maximum  from  0  low-output  impedance  source 
The  100-1 A  consists  of  lb  bandpass  filters  each  followed  bv 
a  hall-wave  rectifier  and  a  second-order  low-pass  filter  with 
25-Hz  cutoff  The  monolithic  100-1A  utilizes  NMOS 
switched-capac  itor  technology  with  80  operational  amplifiers 
to  achieve  the  required  audio-spectrum  analysis  Additional¬ 
ly,  this  chip  contains  a  lb-ihannel  analog  multiplexer  and 
decoder  which  require  timing  signals  from  a  single  TTL  1-MHz 
clock  The  analog  multiplexer  is  addressed  via  four  TTL  lines 
The  analog  output  of  rhe  100-1 A  chip  is  rrom  a  buffer 
amplifier  This  output  <s  suitable  'or  a  0-  to  5-volt  user- 
supplied  analog-to-dignal  converter 

The  100-1B  chip  is  interstate  s  40-pin  recogn  zer  controller 
This  chip  contains  the  entire  algorithm  for  recognition  of  iso¬ 
lated  speech  utterances  including  1  word  boundary  detec¬ 
tion,  1 2)  amplitude  normalization.  3i  end  point  time  com¬ 
pression.  and  '4)  programmable  vocabulary  syntax  The  100- 
1B  provides  parallel  I  O  and  (  ontrol  of  the  analog  multiplexer 
and  the  analog-to-digital  converter  Commands  provided  via 
the  parallel  input  port  are  interpreted  bv  the  100- IB  chip 
Recognition  and  command  responses  are  provided  via  the 
parallel  output  port  All  data  I/O  is  m  the  form  of  ASCII 
characters 

Efficient,  Real-Time  Performance 

Speech  input  is  analyzed  by  j  ?b-<hunne<  spectrum  analyzer 
and  converted  to  a  digital  representation  of  the  charac¬ 
teristics  of  fhe  spoken  input  This  digital  data  is  then  con¬ 
verted  to  a  fixed-size  pattern  that  preserves  the  information 
content  of  the  spoken  inputs  wmle  discarding  redundant 
features  During  word  framing,  thesr  patterns  are  used  to 
derive  templates  tor  each  vo<  abuiarv  nem  The  templates  are 
then  used  m  the  recognition  process  tor  comparison  with 
incoming  spoken  words  Vocabulary  templates  are  stored  in 
the  external  RAM  while  the  processing  algorithms  are  con¬ 
tained  within  the  speech  analvzer  device 

The  ROM  accommodates  eleven  usei  commands  These 
include  two  training  modro  1  normal  training  m  which  all  or 
part  of  the  sprahed  vocabulary  is  cleared  and  then  trained  a 
selectabfe  number  of  samples  and  2>  updating  of  word  pat¬ 
terns  m  which  the  stored  reference  patterns  of  the  specified 
vocabulary  are  augmented  by  additional  training. 

The  VRC100-J  training  algorithm  automatically  rejects 
utterances  during  training  that  do  not  sufficiently  agree  with 
fhe  same  utterance  from  previous  training  samples  or  the 
word  This  prevents  sigmlicant  alteration  or  a  vocabularv 
reference  pattern  caused  bv  spurious  noise  ‘bumping  the 
microphone,  door  <  losure.  coughing)  speaking  inconsisten¬ 
cies.  or  simply  tailing  to  utter  the  prompted  yocabularv  item 
in  such  an  event  it  mav  he  necessary  to  repeal  an  utterance 
before  being  prompted  to  the  next  sequential  utterance 
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Additional  lommands  alow  the  VRC  100-1  to  upload  of 
download  reference  patterns  via  the  selected  1-0  port  The 
re>*e!  i  ommand  is  used  to  initialize  the  VRC  100-1  chip  set 
RAM  and  to  define  l  O  mode  and  format  The  VRC  100*1  chip 
>et  also  allows  control  of  the  rejection  of  invalid  utterances 
via  the  set  reject  and  read  reject  threshold  commands 

An  additional  command  allows  programmable  control  of  the 
analog- to-dignal  reference  voltage  and  preamplifier  gain. 
Finally  the  major  operational  command  ot  the  VRCIOOvl  is 
the  recognize  command.  This  command  allows  recognition 
or  anv  specified  vocabulary  up  to  100  words.  The  command 
allows  recognition  of  both  contiguous  and/or  random  syntax 
with  one  or  more  common  subvocabu  lanes 

A  Family  of  Voice  RecognHion  Product! 

The  Model  VRC100-1  chip  set  and  other  Interstate- 
developed  chip  sets  benefit  both  OEMs  and  end  users  by 
enabling  the  design  flexibility  to  support  a  wide  range  ot 
applications  interstate's  family  of  speech  recognition  pro¬ 
ducts  can  be  economically  incorporated  into  a  vanety  of 
industrial  systems  and  consumer  products  -  fiom  large-scale 
nventorv  control  equipment  to  personal  computers  and 
hobby  items. 

MODEL  VRC100-1  SPECIFICATIONS 
Performance 

Vocabulary  Size:  Up  to  100  isolated  words  and/or  phrases 
Percent  Recognition  Accuracy:  99+  percent 
Reject  Threshold:  User-selectable. 

Longest  Utterance  Duration:  1  25  second 


r u i  1  1 1 •  •  i  i * i •  i  .ii  is 


Minimum  Between- Word  Pauses  (User  Selectable):  160 

milliseconds 

Minimum  Word  Length  (User  Selectable):  00  milliseconds 
Approximate  Response  Time:  (SO  +  2NI  milliseconds, 
where  N  -  active  vocabulary  sue  with  a  4  MHz  crystal. 

Host  Commands 

1  Tram 
2.  Update 

3  Reset 

4  Set  reject  threshold 

5.  Read  reject  threshold 

6.  Download 

7.  Upload 

8.  Set  analog-to-digital  preamplifier  gain 

9.  Recognize 

10.  Wnte  parameters 
1 1  Read  parameters 

Input/Output 

Parallel  TTL  input/out.  eight  data  input  bits,  eight  data  output 
bits  with  four  control  lines.  All  data  input/output  is  m  the 
form  of  ASCII  characters. 

Mechanical 

Speech  Preprocessor  Chip:  Dual  in-line  26-pm  package. 
Recognizer/Controller  Chip:  Dual  in-line  40-pin  package. 

Electrical 

Power  Requirements:  100-1  A:  10V. -10V  at  30  mA;  100- 
1B:  +5  Vdc  at  240  mA 


.  ntmuir  I  Corporator  W'vfi  »i|ht  to  m*k#  i  10  4ny  P*«»dwi*  10  improve  rr4*b»ll«v  'untlton  of  imwm doti  woe  mimW  any  WtoAny  wung 
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Voic«  Products  Operations 

1001  E  Sail  Road.  PO  Box  3117  Anaheim,  California  92003 
Telephone  714/635-7210  TWX  910-591-1197  Telex  655443  &  655419 
Call  toll-free:  m  the  continental  U  S.  000/854-6979:  m  California  800/422-4580 
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plug-in  board  for  use  in  the  DSC  VT100  terminals  and  similar 
models.  It  basically  transforms  spoken  words  into  ASCII 
strings  and  transmits  these  strings  to  the  host  computer. 
The  module  costs  SI, 295.  It  was  announced  in  the  July 
19,1982,  issue  of  Computerwor Id. 

Interstate  has  recently  expanded  their  product  lines  in 
the  areas  of  voice  input  and  output  devices  substantially. 
Their  whole  approach  typifies  the  increasing  emphasis 
manufacturers  are  now  placing  on  speech  products  in  general. 
The  consumer  is  finding  new  devices  on  the  market  faster 
than  ever  before,  and  this  trend  is  expected  to  continue  to 
accelerate. 

Nippon  Electric  Company 

Nippon  Electric’s  speech  recognition  products  may  be 
classified  as  being  in  the  relatively  higher- priced , 
systems- leve  1  product  category. 

Nippon  Electric’s  DP-100  speech  recognizer  gained  wide 
attention  for  its  ability  to  recognize  connected  speech. 
Their  newest  recognizer,  the  DP-200,  is  generally  similar  to 
the  DP- 100  but  at  a  reduced  price. 

The  DP-200  uses  dynamic  programming  to  match  templates 
obtained  during  training  with  those  from  incoming  speech 
data.  One  very  noteworthy  point  about  the  OP-200  is,  again, 
its  ability  to  recognize  connected  speech.  It  also  has  a 


larger  vocabulary  than  the  DP-1  00  (150  words  vs.  120  words) 


The  DP-200  is  approximately  1/1  the  size  of  the  DP- 100.  The 
price  of  the  DP- 200  is  approximately  20-10*  less  than  that 
for  the  DP-100,  which  would  make  it  approximately  $11,000. 

There  are  several  further  comments  to  make  regarding 
the  DP- 200: 

1) The  DP-200  will  recognize  dialects. 

2)  Minimal  training  is  required;  one  pass  for  most 
words,  two  for  numerics. 

3)  The  DP- 200  uses  dynamic  programming  to  "warp"  time 
frames  of  incoming  speech  to  achieve  best  matching  of 
words  in  the  shortest  time  possible. 

4) Optional  audio  response  is  available. 

5)  The  DP-200  has  a  wider  range  of  interfacing 
capabilities  than  the  DP-100. 

A  final  point  to  mention  regarding  the  DP-200  is  that  it  is 
a  stand-alone  system  of  a  fairly  compact  nature,  consisting 
of  a  speech  recognition  terminal,  a  remote  control  terminal, 
and  a  noise-cancelling  microphone.  Following  is  Table  12 
which  contains  a  brief  description  of  the  DP- ?00»s  approach 
to  speech  recognition,  in  addition  to  a  block  diagram  of  the 
Nippon  system. 


Scot t  Instruments 

Scott  Instruments  markets  terminal  or  systems  level 
speech  recognition  products.  Scott  Instruments’  voice  entry 
terminal  does  not  come  with  a  host  computer;  this  points  out 
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CSR: 

Compatible  Human  to  Machine  Communications 


With  the  NEC  OP-2QO  CSR  technology 
has  at  last  created  the  ideal  way  of 
dealing  with  machines,  using  and 
controlling  them  The  results  are 
striking  ^data  entry  is  as  simple  aa 
sitting  down  and  speak-ng  in  a  normal 
conversational  tone,  while  a  computer 
captures  your  words  instantly  What 
was  once  deemed  technically  impossi¬ 
ble  is  now  a  reality  With  the  OP — ZOO 
CSR  NEC's  research  team  has  not 
only  produced  a  quality  data  entry 
system  out  also  created  the  most  direct 
and  efficacious  way  yet  for  man  to  use 
and  control  machines 


Until  now.  machines  could  neither 
recognize  nor  process  speech  patterns 
Jhat  vaneb  both  >n  speed  and  >n  context 
Previous  linear  computers  were 
stumped,  unable  to  recognize  where 
one  word  ended  and  the  next  one 
began  The  thousands  of  vacations 
contained  m  normal  connected  speech 
were  deemed  too  great  a  task  to  be 
logically  identified  and  stored  by 
a  machine 

The  NEC  CSR  uses  Dynamic  Program¬ 
ming  (DP).  a  time  normalization  method 
which  has  effectively  solved  these  once 
unsur mountable  problems  OP  allows 


errors  due  fo  word  segmentation  and 
incorrect  matching.  DP  has  provided 
the  technological  leao  forward 
necessarv  to  achieve  direct  and 
compatible  human-lo machine 
communications 


The  heart  of  the  DP  200  ues  m  a  senes 
of  high-speed  computations  utilizing 
Dynamic  Programming  techniques 
incoming  speech  signals  from  the 
ope  rater's  microphone  las  analog 
waveforms)  pass  through  a  soectrum 
analyzer  and  are  immediately  con 
verted  to  a  digital  signal  The  signal  is 
transferred  and  compared  to  the  ore- 


the  fact  that  drawing  the  line  between  systems  Level 


products  and  board  level  products  is  not  always  a  clear  cut 
distinct  ion. 

Scott  Instruaents  markets  one  basic  recognizer,  the 
VET-2  Voice  Entry  Terminal.  The  basic  system  consists  of 
the  VET-2  preprocessor,  software  and  demonstration  prograas, 
operations  manual,  and  a  noise-cancelling  microphone. 

The  VET-2  is  available  for  Apple  or  TRS-BO  computers. 
It  is  noted  that  the  unit  interfaces  with  off-the-shelf 
software,  or  programs  may  be  written  in  BASIC, 
INTEC.EH-BASIC,  APPLESOFT-BASIC,  or  machine  code. 

The  VET-2  has  a  40-word  basic  vocabulary,  with  an 
overlay  feature  to  allow  access  to  additional  vocabulary 
residing  in  disk  storage.  The  Vet-2  claims  an  accuracy  rate 
of  ORX*.  A  five  or  six  training  pass  approach  is  suggested. 

Scott  Instruments'  approach  utilizes  an  acoustic 
preprocessor  to  analyze  the  acoustic  signal  within  a  range 
of  300-4000  Hz.  Analysis  consists  of  breaking  the  frequency 
range  down  into  2  regions  (300-1000  Hz.  and  1000-4000  Hz.), 
then  taking  zero-crossing  measures  in  both  regions  and 
extracting  the  amplitude  envelopes  of  the  two  regions.  The 
four  resulting  analog  data  lines  are  converted  to  digital 
form  at  the  request  of  the  host  computer. 

Words  for  the  VET-2  can  be  up  to  1.5  sec.  in  duration, 
and  up  to  20  characters  long.  The  template  area  for  a 
40-word  vocabulary  requires  approximately  4600  bytes  of 


storage-  Control  software  requires  approximately  6000 

bytes  for  a  total  of  10. 6K  memory  required  in  the  host 
computer.  Following  is  Table  11  with  operation 

specifications  and  key  features  of  the  VET-2  which  retails 
for  a  relatively  inexpensive  $ 795.  One  the  most  difficult 

areas  to  show  comparison  of  recognizers  is  in  price,  for  the 
capabilities  vary  as  a  function  of  cost.  Nontheless,  cost  is 
a  major  consideration  to  most  buyers. 

Th  resh old  Technology 

Threshold  Technology’s  speech  recognition  products  are 
generally  geared  toward  the  high  end  of  the  market;  their 
speech  recognizers  are  characterized  as  systems  level 
products. 

Threshold  Technology  has  just  started  a  new  subsidiary, 
called  Auricle.  Together,  these  two  groups  market  three 
basic  lines  of  speech  recognizers.  The  two  Threshold 

recognizers,  the  580  and  600  units,  are  among  a  select  few 
recognizers  that  will  accept  connected  speech  input. 
Threshold's  approach  uses  dynamic  programming,  where  words, 
rather  than  combinations  of  phonemes,  are  recognized  as 
units.  Threshold's  Votrax  unit  uses  a  16-channel  bandpass 
filter  with  a  rectified  compressor  with  proprietary 
circuitry  and  a  commercial  codec. 

Threshold  emphasizes  that  its  units  will  accept 
connected  words  with  very  short  interword  pauses.  Threshold 
calls  this  feature  "Quiktalk".  In  fact.  Threshold  notes 
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TABLK  1  i 


Scot*-  Instruments 


KEY  FEATURES 


Available  on  the  APPLE  or  TRS-80 

Easy  to  use - interfaces  with  off- 

the-shelf  software  or  programs  may 
be  written  in  BASIC,  INTEGER-BASIC, 
APPLESOFT-BASIC  or  MACHINE  CODE 

KEYVET  feature  allows  voice  to  be 
used  in  conjunction  with  the  key¬ 
board  (Apple  only) 

Multiple  user  capabilities  with  no 
increase  in  storage  requirements 

40-word  vocabulary  with  overlay 
feature  to  access  additional  vo¬ 
cabularies  from  disk 

High  accuracy  (98%+) 


SPECIFICATIONS 


Requires  an  Apple  II  or  Apple  II- 
plus,  48K  machine  with  at  least  one 
disk  drive,  or  a  TRS-80  Model  I 
with  32K  or  48K  and  two  disk  drives 

SIZE:  Approximately  IV  H  x8"  W 
x  11"  D 

WEIGHT:  Approximately  5  lbs 

POWER:  Apple  power  supply  or 

TRS-80 - 115  VAC.  60  Hz.  15  Watts 
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that  it  will  accept  words  faster  than  normally  encountered 


in  continuous  speech  (180  wpn).  It  is  noted  that  this 
feature  permits  data  entry  much  faster  than  via  keyboard 
entry,  with  a  claimed  accuracy  of  99%*, 

Prices  for  units  with  the  Ouiktalk  feature  should  be. 
Threshold  states,  10-20  %  above  those  for  the  older  500, 
600,  and  VIP- 100  units,  which  can  be  upgraded  to  include 
this  feature.  Thus,  the  580  unit  sells  for  approximately 
516,000;  the  680  unit  sells  for  approximately  $11,300. 

The  Threshold  580  recognizer  has  a  60-word  or  phrase 
vocabulary  (expandable  to  340).  It  includes  two 

noise-cancelling  microphones.  It  produces  ASCII  coded  output 
and  has  a  16-character  alphanumeric  display  for  voice  data 
entry  and  verification.  Also  included  are  ready  and  reject 
indicators.  The  unit  has  a  reject  decision  level 
(externally  set  by  program  control).  The  580  accepts 

words/phrases  up  to  two  seconds  in  duration. 

The  Threshold  680  recognizer  features  a  50-word  or 
phrase  vocabulary,  and  also  produces  ASCTT  coded  output.  Tt 
permits  local  tape  cartridge  storage  of„  user  speech 
patterns,  training  prompts,  and  output  messages.  The  680 
includes  a  CPT  display  terminal  for  operator  prompting, 
editing,  and  verification.  The  unit  is  current  loop  output 
compatible  from  50  Baud  to  19.2  Baud.  The  680  has  optional 
wireless  radio  input.  It  accepts  words  or  phrases  up  to  two 
seconds  in  duration,  like  the  580.  Finally,  for  a  quick 
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comparison  of  features  of  the  5R0/6R0  recognizers.  Table  14 
jives  a  short  listing  of  comparative  specifications. 

Threshold  580/680  units  appear  to  utilize  proprietary 
filter  algorithms  which  statistically  match  digitized  input 
speech  data  with  stored  templates.  Threshold  states  that 
its*  Quiktalk  feature  consists  of  recognizing  strings  of 
words  as  units,  rather  than  recognizing  individual  words. 
This  permits  detection  of  the  shortest  possible  pauses 
between  words.  Thus  both  of  these  units  nearly  attain  the 
long-sought  goal  of  continuous  speech  recognition. 

Further  specifications  are  given  in  Tables  IS  and  16 
for  the  Threshold  580  and  680  recognizers. 

The  Auricle-I  is  designed  to  be  a  lower  cost 
recognizer,  with  a  vocabulary  of  80  words  or  short  phrases. 
Tt  includes  L5T  circuitry,  boasts  an  accuracy  rate  over  99%, 
and  includes  a  settable  reject  level.  One  purpose  of  the 
Auricle-T  is  to  function  as  a  benchtop  development  system 
which  will  help  familiarize  designers  with  speech 
recognition  and  help  them  decide  if  such  an  approach  is 
suitable  for  their  end  products.  The  Auricle-1  costs 
$2,500.  Also  under  development  is  a  board- level  product,  the 
Auricle-II,  which  is  a  speech  recognizer  card. 

The  Auricle-I  uses  a  16-channel  bandpass  filter  and  a 
rect i fier/compressor  that  consists  of  proprietary  circuitry 
and  a  commercial  codec.  The  Auricle 's  host  Z-80  correlates 
input  templates  with  stored  templates  to  determine  matches. 
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TABLK  1  4 


Threshold 


--  COMPARATIVE 


(Operating  speeds 

i  >  99\  accuracy) 

Vocabulary 


Display 


Speaker  Training 
Data 

Output  Code  for 
Each  Utterance 


Vocabulary  Display 
Message  for 
'Operator  Proiapting 

Electrical 

Interface 


Software  Compat¬ 
ibility  with 
Standard  Teletype 
Terminal 

(Rev.  3-14-80) 


Tt"'hnoloqy  5  80/680  Recoqmzers 


FEATURES  OF  BOTH  SYSTEMS  ARE  SHOWN  RE  LOW  -- 


THRESHOLD  b80 

THRESHOLD  SBO 

180  words/minute 

ISO  words/minute 

40  words  expandable 
to  2S0  words 

60  words  expandable 
to  3?0  words 

CRT 

16-character 

alphanumeric 

Stored  in  local 
tape  cassette 

Stored  in  host 
computer 

User  programsable 
character  or  string 
of  characters;  stored 
on  local  tape 
cassette 

Unique  ASCII  code 

User  programmable; 
stored  on  local  tape 
ca*-  sette 

Control  led  by  host 
computer 

Standard  RS232C  or 
current  loop,  serial 
asynchronous  ASCII 

Standard  RS23ZC  or 
current  loop,  serial 
asynchronous  \SCII 

Fully  compatible  and 
no  special  software 
required 

Requires  special  host 
computer  software  to 
handle  communications 
protocol 
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Specifications 

Power  Requirer  .-a 

110/220  VAC  single  phase;  50/60  Hz;  125  Watts 
(with  standard  featives) 

Operating  Temperature 

10  to  40°  C  (50  to  104*F) 

Non-operating  Temperature 

-40  to  66 °C  (-40  to  150°F) 

Humidify 

10  to  904,«,  non-condensing 

Dimensions,  in  (cm) 
Processor 

17.75  x  5.23  x  26.00  (45.0  x  13.3  x  66.0) 

Display 

1 1 .00  x  4.75  x  13.75  (27.9  x  12.0  x  34.9) 

Local  Operator  Console 

10.00  x  5.00 x  4.00(25.4x  12.7  x  10.2) 

Weight,  lb- (kg) 

50  (23) 

specifications  tubjoci  10  change  without  notice 


mands  or  receiving  messages  or  requests  for 
input  verification. 

The  Threshold  500/580  terminals  feature 
Threshold’s  exclusive  QUIKTALK"1  -  the 
closest  yet  to  connected-word  or  continuous 
speech  recognition.  QUIKTALK  more  than 
doubles  the  rate  at  which  operators  may 
communicate  with  their  computers  by  per¬ 
mitting  pauses  between  words  to  be  shorter 
than  required  with  ordinary  isolated  word 
recognition  systems.  At  an  entry  rate  of  180 
words  per  minute,  operators  have  con¬ 
sistently  achieved  better  than  99  "o  accuracy. 

Model  500  typically  provides  data  entry 
rates  up  to  120  words  or  phrases  per  minute. 
Where  higher  processing  speeds  are  re¬ 
quired,  Model  580  offers  a  typical  input  rate 
of  180  words  or  phrases  per  minute. 

~  'jttircs 

-  Fully  interactive  communication 

-  Hands-lree  operation 

-  60  word  or  phrase  vocabulary,  optionally 
expandable  to  340  words  or  phrases 

-  Two  lightweight,  noise-cancelling,  head- 
band  microphones 

-  ASCII  coded  output 

-  EIA-RS232-C,  CCITT-V24  or  20mA  cur¬ 
rent  loop  teleprinter  output  compatible 
from  50  baud  to  19. 2K  baud 

-  16-character  alphanumeric  display  for 
voice  data  entry  and  verification 

-  READY  and  RJEJECT  indicators  to  show 


operator  when  the  system  is  ready  to 
receive  speech  and  when  it  does  not 
understand  the  input  speech  (REJECT  in¬ 
dicator  optionally  audible) 

-  Reject  decision  level  can  be  externally  set 
by  program  control 

-  Structuring  (vocabulary  subset 
selection)  can  be  externally  set  by 
program  control 

-  Remote  voice  input  control 

•  Training  mode  and  speaker  identification 
selector 

•  RAM  semiconductor  memory  optionally 
expandable  to  340  word  vocabulary 

-  Optional  wireless  radio  input 

-  Optional  rack-mount  or  desk  top  con¬ 
figuration 

•  Accepts  words  or  phrases  up  to  two 
seconds  in  length 

V  irrniH-.' 

All  Threshold  Voice  Data  Entry  Systems 

carry  a  90-day  warranty  for  parts  and  labor 


the  data  entry  company  that  has  people  talking 

1829  Underwood  Blvd  Delran  New  Jersey  08075 
(6091  461  9200 


Covered  bv  patents  in  the  U  S. A.  and  foreign  countries 


Printed  in  ll  S  A 
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Specifications 


Power  Requirements  110/220  VAC  tingle  phase;  50/60  Hz;  125  Watts 

(with  standard  features) 

Operating  Temperature  10  to  40°C  (50  to  104°F) 

Non-operating  Temperature  -40  to  66°C  (-40  to  150“F) 

Humidity  10  to  90^»,  non-condensing 

Dimensions,  in  (cm) 


Processor 

17.75  X 

5.25 

X 

26.00 

(45  Ox 

13.3  x 

66.0) 

Display 

15.00  x 

14.00 

X 

13.50 

(38.1  x 

35.6  x 

34.6) 

Keyboard 

17.00 1 

2.75 

X 

T.50 

(43.2  x 

68.0  x 

18.7) 

Tape  Unit 

8.25  x 

12.75 

X 

16.25 

(20.9  x 

32.4  x 

41.3) 

Weight,  lb  (kg) 

62(28) 

Specimen  son*  tubiect  to  change  without  notice. 


tcractivc.  operators  can  enter  into  true  two- 
way  communication  with  their  computer, 
whether  entering  data,  giving  spoken  com¬ 
mands  or  receiving  prompts  or  requests  for 
input  verification. 

Threshold  600/680  terminals  feature 
Threshold's  exclusive  QUIKTALK"  -  the 
closest  vet  to  connected -word  or  continuous 
speech  recognition  QU1KTALK  more  than 
doubles  the  rate  ji  which  operators  may 
vomir.uni.  il  -  with  their  computers  by  per¬ 
mitting  pauses  between  words  to  be  shorter 
than  those  required  with  ordinary  isolated 
word  recognition  systems.  At  an  entry  rate 
words  per  minute,  operators  have 
.  i.sistently  achieved  better  than  99ro 
accuracy 

Model  600  typically  provides  data  entry 
rates  up  to  120  words  or  phrases  per  minute. 
Where  higher  processing  speeds  are  re¬ 
quired.  Model  680  offers  a  typical  input  rate 
ol  180  words  or  phrases  per  minute. 

I-  •  it  ::r**s 

-  Fully  interactive  communication 
•  Hands-free  operation 

-  User -programmable  vocabulary  selection 

-  Local  editing  and  control 

-  '0  word  or  phrase  vocabulary,  optionally 
expandable  to  250  words  or  phrases 

-  Local  tape  cartridge  storage  of  user 


speech  patterns,  training  prompts  and 
output  messages 

-  Two  cartridge  tapes 

-  CRT  display  terminal  for  operator 
prompting,  editing  and  verification 

-  Two  lightweight,  noise-cancelling,  head- 
band  microphones 

-  ASCII  coded  output 

-  EIA-RS232-C,  CC1TT-V24  or  20mA  cur¬ 
rent  loop  teleprinter  output  compatible 
from  50  baud  to  19. 2K  baud 

•  Host  processor  vocabulary  subset  selec¬ 
tion  and  control 

-  Optional  wireless  radio  input 

-  Optional  rack-mount  or  desk  top  con¬ 
figuration 

-  Accepts  words  or  phrases  up  to  two 
seconds  in  length 

vV  rnnt  y 

All  Threshold  Voice  Data  Entry  Systems 

carry  a  90-day  warranty  for  parts  and  labor. 


the  data  entry  company  that  hat  people  talking 

1829  Urxferwood  Blvd  Oefran  New  lersev  08075 
1 809* 481  9200 


Covered  by  petent*  in  theU  S.A.  ind  foreign  count  net 


Printed  in  U  S.  A.  •  f»*l 
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The  Auricle-T  has  a  40-word  vocabulary;  this  vocabulary  can 
be  enlarged  by  using  a  host  computer's  memory  to  store 
templates. 

The  Auricle-I  requires  three-pass  training  for 
vocabulary  items.  The  unit  limits  vocabulary  to  a  very  low 
number  of  highly  differentiated  responses,  so  it  is  possible 
to  program  in  templates  with  a  wide  variety  of 
pronunciations.  Response  time  for  the  Auricle-I  is  listed 
at  350  msec.  for  a  word  of  less  than  1.2  seconds  in 
duration.  It  is  noted  that  the  Auricle-I  is  a  completely 
stand-alone  system  with  its  own  power  supply  and 
noise- cancel  1 ing  microphone.  The  unit  sells  for  $2,  480. 
Further  information  on  Auricle-T  is  given  in  Table  17. 

v^rbe x 

Verbex  speech  recognition  products  fall  into  the  high 
end  systems  level  category. 

Verbex  is  one  of  the  oldest  manufacturers  of  speech 
recognizers,  and  has  been  a  pioneer  in  the  area  of 
speaker-independent  speech  recognition  technology. 

Currently,  Verbex  markets  two  spe  ch  recognizers.  The 
Verbex  Model  IflOO  is  an  isolated  word,  speaker-independent 
speech  recognizer  (multi-user).  The  1800  comes  with  a 
recognition  vocabulary  consisting  of  the  10  digits,  "zero" 
through  "nine",  plus  "yes/no".  This  vocabulary  can  be 
expanded  to  50  words.  The  1800  includes  voice  response  (32 
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tabi.k  1  7 

Aur  ifli"  I  Ri>i'ii(|n  i  i  ion  5  y  r;  l  <-in 


PreMmlnary  Specifications 


Feature* 

•  Self  contained: 

Auric!*-!  com**  complete  with  power  supply,  noise¬ 
cancelling  microphone  and  all  necessary  connectors. 

•  Easy  to  interface: 

Auncie-f  delivers  serial  ASCII  code  to  RS-232-C  interfaces 
through  a  OB25  connector. 

•  Easy  to  train 

To  enter  a  word  into  Auricle-ls  vocabulary,  the  user  need 
only  say  it  three  times. 

•  Easy  to  use: 

Auricle-Ts  front  panel  has  large  controls  and  indicators 
that  are  visible  and  accessible  from  a  wide  angle. 

•  Large  vocabulary. 

80  words  or  short  phrase* 

•  High  freedom  from  error 

Advanced  LSI  circuitry  makes  the  Auricle-1  more  than  99% 
accurate. 

•  Settable  rejection  level: 

The  user  can  define  the  decision  threshold  at  which 
Auricle-I  differentiates  similar  words. 

•  Easy  to  develop: 

Aurlcle-l  has  an  internal  'monitor'  program  that  provide* 
the  user  with  a  simple  method  to  evaluate  different  appli¬ 
cations  and  vocabularies 

•  Optional  IEEE-488  Bus  interface 


SpecfflcaUona 

Electrical: 

Supply  requirements  —  t15VAC/60Hz  (or  230VAC/50HZ) 
Power  consumption  —  9  Watts 
Microphone  input  Impedance  —  510  ohms 
Output  —  RS-232-C  compatible  serial  ASCII  code: 

Baud  rate  selectable  from  300  Baud  to  19.2  Kilobaud 

Speech: 

Vocabulary  size  —  80  words,  expandable 
Maximum  utterance  —  1.2  seconds  duration 
Response  time  —  less  than  300  ms. 

Accuracy  —  99% 

Environmental: 

Operating  temperature  range  —  0-50°C 
Relative  humidity  — 10%-90%.  non-condensing 

Dimensions: 

Height  —  3  inches 
Width  — 12  inches 
Depth  — 13  inches 
Weight  —  4  lbs 

Warranty: 

Against  defects  in  material  and  workmanship  for  90  day 

Prices: 

Please  contact  Auricle  or  authorized  representative  for 
price  and  delivery  information. 


Auricle.  Inc..  A  Subsidiary  of  Threshold  Technology  Inc. 
20823  Stevens  Creek  Blvd..  Cupertino,  CA  95014, 

(4081  257-9830 


aurade 
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words  or  16  seconds  of  speech)  ;  this  can  he  expanded  up  to 
512  words  or  256  seconds  of  audio  response.  Roth  the 
recognition  and  response  vocabularies  can  be  customized  as 
required  for  individual  applications.  The  1800  can  also  be 
used  over  the  telephone.  Table  18  indicates  the  Verbex  Model 
1800’s  basic  specifications. 

The  second  speech  recognizer  marketed  by  Verbex  is  the 
Verbex  Model  1800-CSRS.  This  unit  is  speaker-dependent  and 
handles  continuous  speech  input.  The  unit  also  has 

single-channel  entry  microphone  input;  it  is  basically 
designed  for  high  accuracy  digit  entry,  plus  10  isolated 
command  words.  Me  wish  to  point  out  the  further  possibility 
of  using  the  Verbex  Model  1800  recognizer  to  spot  keywords 
in  incoming  Coast  Guard  radio  transmissions.  T1  is  unit  is 
designed  to  accept  isolated  words  as  input,  but  this  nay  be 
sufficient  for  the  Coast  Guard’s  needs.  That  is,  we  suspect 
that  keywords,  such  as  "mayday,"  may  actually  he  pronounced 
slowly  enough  for  the  Verbex  unit  to  recognize  these  with 
high  accuracy.  One  unknown  in  this  area  concerns  how  the 
Verbex  unit  might  generate  "false  alarms"  for  connected 
speech  input,  from  which  relatively  "isolated"  keywords 
would  have  to  be  separated  by  the  recognition  unit. 

Voicetek 

Voicetek’s  product  line  of  speech  recognition  products 
may  be  categorized  as  board  level.  With  their  relatively 
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inexpensive  board  level  recognizers,  Voicetek  is  seeking  the 
home  computer,  hobbyist  market, 

Voicetek  markets  a  line  of  inexpensive  voice 
input/output  devices  for  small  computer  systems.  Voicetek 
notes  that  they  have  been  able  to  achieve  inexpensive  prices 
for  their  voice  I/O  devices  due  to  their  having  successfully 
compressed  required  electronics  onto  a  single  integrated 
circuit  chip.  Following  is  a  list  of  major  features 
regarding  Voicetek's  voice  I/O  devices,  which  are  called 
Cognivox  units: 

1) 0nlike  speech  recognizers  that  employ  frequency 

domain  (filter  bank)  analysis,  Cognivox  units  operate 
on  the  time-domain  signal.  This  allows  for  high 
performance  at  low  cost.  Cognivox  units  also  use  a 
new  and  exclusive  nonlinear  pattern  matching 
algorithm  to  enhance  performance.  Voicetek 

technology  does  involve  the  use  of  Fast  Fourier 
Transforms  ( F  FT ) ,  but  details  are  not  available  in 

this  area. 

2)  Voicetek  units  have  been  given  a  50  hour  burn-in  or 
testing  period. 

3)  A  Cognivox  unit  is  priced  lower  than  either  a 
comparable  speech  recognizer  or  a  voice-response 
unit,  yet  it  combines  both  features. 

4)  A  Cognivox  unit  features  easy  training,  with  the  user 
repeating  the  desired  vocabulary  three  times  at  the 
prompting  of  the  host  computer. 
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Voicetek' s  speech  products  .ire  basically  divided  into  three 
lines: 


1  )VI 0-1000  series  of  voice  I/O  peripherals.  These  are 
priced  at  $249  and  are  for  Rockwell  AIH-6S,  PET/CRH 
16K  or  32K,  or  Apple  II  computers.  This  is  the 
top- of- the- line  Voicet.ek  unit. 

2) VI0-XXX  series  of  voice  I/O  peripherals.  These  are 

priced  at  $149  and  are  for  economical  voice  I/O 
applications  that  do  not  require  high  fidelity  speech 
output.  The  VIO-XXX  is  suitable  for  Bxidy's 
Sorcerer,  Z-90  based  systems,  TRS-RO,  LIT,  16 K,  and 

PET/  CRfl,  16K  or  32K  computers. 

3) SR-100A  and  SR-100P  units.  These  are  speech 

recognition  peripherals  for  the  A  111-66  (4K)  as  well 
as  the  PET/CBM  (8K,  1f>K,  and  32K)  computers. 

Finally,  note  that  Voicetek  software  is  written  in  RASIC  on 
cassette.  it.  has  filtering  routines,  including  PPT.  The 
following  Table  19  summarizes  major  Voicetek  features. 


Vo  tan 

Votan  manufactures  both  systems  and  board  level 
recognition  products. 

Votan  has  a  very  interesting  approach  to  voice  I/O, 
which  consists  of  a  reversible  algorithm  that  works  for  both 
voice  input  and  voice  output.1  Their  t.op-of -the-line  model 


*  Voice  output  is  an  imminent  enhancement. 
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TABU-:  1  T 


VOICETEK  COGNIVOX 

1.  Speech  input  and  output  combined  in  one  unit. 

COGNIVOX  is  the  only  unit  in  the  market  that  allows  both  speech  input 
and  output.  Our  experience  with  COGNIVOX  and  customer  feedback 
indicates  that  this  is  the  way  to  go  in  speech  peripherals. 

2.  Extensive  applications  software  and  support. 

This  is  the  area  of  crucial  importance  since,  for  most  users,  the 
utility  of  a  given  system  is  directly  proportional  to  the  available 
applications  software.  Recognizing  this  reality,  VOICETEK,  has  a 
strong  commitment  to  seek  out  and  develop  applications  for  speech  I/O. 

We  currently  offer  two  sophisticated  speech-operated  video  games,  as 
well  as  utilities  such  as  a  talking  calculator,  vocal  memory  dump,  etc. 
We  are  also  working  on  a  series  of  application  articles  to  be  published 
in  the  major  microcomputing  magazines,  such  as  BYTE,  Kilobaud  and 
Creative  Computing.  Applications  include  speech-operated  instruments, 
voice-controlled  machines  and  toys,  talking  appliances  that  res  >nd  to 
spoken  commands,  and  so  on. 

3.  State-of-the-art  design. 

Unlike  other  speech  recognizers  that  employ  frequency  domain  analysis, 
COGNIVOX,  operates  on  the  time-domain  signal.  This  novel  and  unique 
approach  allows  for  high  performance  at  low  rost.  In  addition,  COGNI¬ 
VOX  employs  a  new  and  exclusive  non-linear  pattern  matching  alogrithm 
that  significantly  enhances  its  performance. 

4.  Quality  hardware. 

The  COGNIVOX  hardware  is  carefully  designed  and  assembled.  It  is 
tested  after  assembly  and  again  after  a  50-hour  bum-in  period  to  insure 
long  and  trouble-free  life.  The  COGNIVOX  hardware  is  enclosed  in  a 
beautiful  injection-molded  instrument  case,  giving  it  an  elegant  appea 
appearance . 

5.  Affordable  price. 

COGNIVOX  is  priced  lower  than  either  a  comparable  speech  recognizer  or 
a  voice-response  unit,  yet  it  combines  both  features.  The  low  price  is 
made  possible  by  innovative  design  and  by  our  conviction  that  voice  I/O 
must  be  priced  right  before  it  gains  the  wide  appeal  it  deserves. 

b.  Easy  Training. 

Today's  technology  allows  only  speaker-dependent  recognizers,  meaning 
that  the  recognizer  must  be  trained  to  the  voice  of  the  individual  user. 
In  the  case  of  COGNIVOX,  this  training  is  very  easy  and  can  be  done 
quickly,  as  the  user  must  repeat  the  vocabulary  three  times  at  the 
prompting  of  the  computer.  Training  the  voice  response  portion  is  also 
very  simple,  requiring  that  the  user  pronounce  the  voice  response 
vocabulary  only  once. 
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is  the  Votan  V  1000.  The  V1000  is  a  stand-alone  development 
systen  designed  to  enable  product  planners  to  evaluate  the 
use  of  Votan’s  speech  technology  in  proposed  or  existing 
products  or  systems.  This  is  an  increasingly  popular 

approach  with  voice  I/O  device  manufacturers  in  general. 
The  V1000  sells  for  $5,000.  Votan  also  markets  the  V2000, 
which  is  an  industrial  control  module,  with  training  and 
display  functions  carried  out  by  the  host  software.  It  has 
a  list  price  of  $4,  400.  Finally,  Votan  markets  the  V1000 
which  is  an  O.E.M.  circuit  board,  at  a  cost  of  $1,000. 

The  VI 000  is  a  stand-alone  device.  It  will  accept 
words  or  phrases  up  to  two  seconds  maximum  duration.  It  has 
a  capacity  of  up  to  100  seconds  word  storage  (approximately 
160  words  single-trained  or  80  words  dou b le- t rai ned) .  The 
V 1 000  will  operate  under  very  high  noise  conditions  (up  to 
85d B  of  background  noise*. 

Votan  uses  an  analog-to-digital  converter  to  transform 
incoming  speech  data  into  a  digital  representation.  A 
proprietary  algorithm  then  processes  the  speech  signal  into 
its  freguency  components.  Following  the  spectral 

transformation,  dynamic  programming  warps  spectral  templates 
for  comparison  with  reference  templates.  The  spectral 
processing  algorithm  is  reversable.  This  means  that  it 
should  be  able  to  accommodate  speech  synthesis  as  well  as 
speech  recognition. 
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Votan’s  synthesis  chip  provides  user  progr amnabilit  y 


which  is  lacking  in  other  LPC-based  synthesis  chips.  Thus 
Votan's  chip  should  be  use  r- trainable  in  the  field  for  easy 
accommodation  of  new  vocabulary  for  synthesis.  This 
approach  should  also  allow  new  flexibility  in  speech 
selection  for  synthesis;  previous  LPC  synthesizers  have  had 
to  be  reprogrammed  in  the  laboratory.  Votan  promises 
significant  future  enhancements  to  their  system.  Following 
is  Table  20,  listing  these  enhancements,  plus  the  general 
operating  specifications  of  the  V  1000, 
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CHAPTER  5 


OVERVIEW  OF  SPEECH  SYNTHESIS  PRODUCTS  AND  TECHNOLOGY 

This  chapter  contains  a  brief  review  of  currently 
available  speech  synthesis  technology,  plus  a  statement  of 
Coast  Guard  operational  requirements  in  this  area.  Three 
categories  of  subject  matter  discussed  below  are:  1)  price 

ranges  of  speech  synthesizers,  2)  different  product  levels, 
and  1)  Coast  Guard  operational  requirements  related  to 
speech  synthesis  technology. 

1)  Price  ranges  of  speech  synthesizers.  Speech 
synthesis  products  vary  widely  in  price.  For  example,  we 
note  that  Centigram’s  complete  voice  development  system 
(model  6700)  costs  approx  i  mat  ely  $29,  500.  On  the  other  end 
of  the  scale,  we  note  various  board  and  chip  level  products 
costing  several  hundred  dollars,  or  less.  Obviously,  speech 
synthesizers  vary  widely  in  performance  as  a  function  of 
price.  This  report  details  the  various  advantages  and 

disadvantages  of  each  of  the  speech  synthesis  products 
reviewed,  so  that  price  value  can  be  determined  for  any 
systems  of  potential  interest  to  the  Coast  Guard. 

?)  Different  product  levels.  This  report  notes  that 
speech  synthesis  products  are  in  three  basic  categories:  LSI 
chip-level  products,  printed  circuit  board  products,  and 
complete  synthesis  systems. 
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LSI  chip  level  products  generally  come  with  no  control 
software  and  must  he  integrated  into  circuit  boards  before 
they  can  be  used.  This  report  cautions  against  using  such 
products  without  being  fully  aware  of  the  engineering  and 
developmental  costs  which  accompany  such  speech 
synthesi zers. 

Printed  circuit  board  products  are  designed  to  plug  into 
the  interface  units  of  existing  host  computers  (RS2  3  2-C  or 
parallel  interfaces).  Board-level  synthesizers  are 

generally  easy  to  integrate  into  existing  host  computers. 
Elaborate  software  is  generally  not  required,  as 
synthesizers  of  this  type  generally  operate  under  ASCII 
input,  from  a  host  terminal. 

Complete  speech  synthesis  systems  are  the  most 
functional.  For  example,  the  Centigram  6700  Voiceware 

Development  System  is  in  this  category.  It  comes  complete 
with  a  host  computer,  CRT  terminal,  digitizer,  and 
Centigram's  Lisa  synthesizer.  This  type  of  system  reguires 
nothing  more  than  being  plugged  into  a  wall  socket  for 
operation.  Such  a  system  is  very  easy  to  operate,  though  it 
tends  to  be  relatively  expensive. 

3)  Coast  Guard  operational  requirements  related  to 
speech  synthesis  technology.  Overall  the  operational 

requirements  for  Coast  Guard  broadcasts  are  for  a 
synthesizer  with  an  essentially  unlimited  vocabulary.  This 
would  allow  for  broadcasting  vessel  names,  geographical 
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locations,  and  detailed  inforsafnn  .  One  application  of 
speech  synthesis  technology  that  oculd  he  lore  with  a 
relatively  small  vocatol  iry  woul  1  le  c  f  yrt  her  iz  ing  weather 
btoa  'least  s  which  car.  he  je  rented  instantaneously  from  the 
teletype  messages.  '"pe*  rh  the:  .in  t  e'-M.  ~  lcqv  woul  1  also 

offer  the  option  of  heir;  ♦  a  .yii‘h,e;;i:c  either  female 

or  male  voices.  -or  •  i  *  -  ‘  c  i  e  t  a ♦  i  c  r.  !  y  f  cast  0  ua  rd 

persenre],  a  »t  *t-to-:  ;ee  ",  .oi-e  ;yn  *  he.>i  zer  would  he  a 
very  logical  method  for  ire«t  sn  j  r  c  ist  Ciiacd  requirements  for 
speech  synthesis.  Ouch  an  appro  ich  merely  requires  that  *  he 
user  type  in  the  desired  text  tc  a  terminal,  with 

desired  voice  output  being  automatically  derived  from  the 
te  x*  -to  -speech  unit.  Several  such  text- *•  c-  speech  units  were 
reviewed.  For  example,  we  suggest  the  +vo  Votrax 

te  xt. -to -speech  units,  plus  t  ele  sens  c  r  v  •  s  latest 

t e xt- t c- s j eech  prototype  unit.  We  had  the  opportunity  to 
evaluate  ""elesensory  *  s  text- to-speech  unit  ever  the 
telephone.  Anyone  wishing  to  hear  a  demonstration  of  this 
unit  should  call  (415)  c69-t2<5'7. 


c.  1  AV  A  T I  *  F  I,  F  rr>F5cq  S  YN  T  P  "t  T  7FP  F 

""here  were  a  number  of  companies  which  srep  Vas  able  to 
;  c - 1 1 e  that  manufacture  voice  output  devices  (speech 
r  yr.t  hes  i  zers)  of  various  types  and  configurations.  rirst, 
♦■here  are  the  analysis  synthesis  synthesizers,  which  have 


.store!  coefficient  value:;  obtained  from  real  speech. 
Se con  1,  there  are  the  rule  synthesizers,  which  model  speech 
upon  various  parameters,  combinations  of  which  are  used  for 
o'tual  synthesis.  Finally,  there  are  the  synthesizers  which 
r**Ly  upon  direct  digitization  and  playback  of  real  speech. 

For  readers  unfamiliar  with  linear  prediction  coding  or 
L PC'  analysis,  reference  is  made  to  Markel,  Cray,  and  Wakita 
(1*37  1)  .  in  this  SCRL  Monograph,  the  authors  detail  LPC 
analysis  which  involves  predict  ing  data  from  past,  data 
samples.  Such  an  approach  is  intimately  related  to  multiple 
regression  and  to  setting  up  digital  filter  coefficients 
which  allow  for  an  economical  (in  terras  of  bit  rate) 
representation  of  speech  data.  LPC  analysis  is  commonly 
used  in  analysis  synthesis. 

As  with  speech  recognition  products,  speech  synthesis 
products  may  tie  broadly  divided  into  three  general  product 
classifications:  systems  level  products,  board  level 

products,  and  chip  level  products.  The  systems  level 
prolucts  are  generally  ch ar act er  iz ab le  by  including  a  host 
computer  and  consisting  of  stand  alone  terminals  which 
synthesize  speech. 

SCR L  received  information  from  the  following  twelve 
man  uf  act.urers  of  speech  synthesizers:  1)  Centigram,  2)Seneral 
Instruments,  1)  I  nt.ersta  te  Electronics,  U)  Kurzweil  Computer 
Frolucts,  5)  Maryland  Computer  Services,  6)  Mimic,  7)MSC, 
8)  National  Semiconductor,  7)  Per  com  Data  Co.,  iO)  Telesensory 
Speech  Systems,  11)Texas  Instruments,  and  1?)  Votr  ax. 
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Centigram's  Lisa  synthesizer  is  in  the  category  of 
hoard-level  products,  as  it  is  designed  to  connect  to  a n 
existing  computer  interface. 

Centigram  has  recently  introduced  their  Lisa  speech 
synthesizer  which  uses  parametric  waveform  coding  intervals 
of  50  nsec.  This  unit  features  a  low  bit  rate  and  connects 
to  an  RS232-C  interface.  The  Lisa  has  memory  storage  for 
30-120  seconds  of  stored  speech  data. 

Centigram  notes  that  their  parametric  waveform  coding 
allows  the  user  to  reprogram  the  unit  in  the  field,  so  that 
new  utterances  may  be  immediately  stored  for  playback.  This 
circumvents  the  problem  encountered  by  most  available 
synthesizers,  which  reguire  reprogramming  for  synthesis  of 
additional  speech  data  at  the  manufacturer's  base  of 
operation.  Table  21  indicates  the  specifications  for  the 
Lisa  synthesizer  which  sells  for  $  3,  450. 

The  price  is  very  reasonable  considering  the  fact  that 
the  unit  may  be  reprogrammed  in  the  field  to  synthesize  any 
desired  utterance.  Centigram  emphasizes  the  high-quality 
voice  output  of  the  synthesizer. 

Centigram  also  markets  a  large  voiceware  development 
system  for  $2°, 500  (model  6700).  This  system  includes  a 
digitizer,  Lisa  synthesizer,  microcomputer,  disk,  floppy 
disk,  and  required  software. 
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TABUS  2  1 


L'entiqram’s  I.isn  Parametric  WdVtform  Cortinq  Synthesizer 


are 


Central  Instruments'  speech  synthesis  products 
geared  toward  the  chip  level  market.  Again,  this  product 
classification  involves  sales  to  original  equipment 
manufacturers  who  wish  to  use  speech  synthesis  in  existing 
products  under  development.  General  Instruments'  basic 
approach  to  speech  synthesis  supports  phoneme  synthesis, 
even  though  it  is  an  analysis  synthesis  approach. 

General  Instruments'  basic  LSI  chip  synthesizer  is 
designated  the  SP02SO.  This  chip  contains  circuitry  for  a 
6-stage,  cascaded  12-pole  programmable  filter  designed  to 
emulate  the  human  vocal  tract.  The  unit  features  simple 
interfacing  with  any  8-bit  microcomputer  and  a  standard  Ron 
to  form  a  complete  speech  system.  The  SP0250  chip  is  also 
used  in  General  Instruments'  stand  alone  speech 

synthesizers.  First,  the  GP0250  is  available  with  a  16K  RON 
and  controller  technology  on  a  single  chip,  as  the  GP02S6. 
Or,  it  is  available  with  a  12K  RON  and  controller  technology 
on  a  single  chip  as  the  SP0212  (for  futare  release).  Table 
22  describes  General  instruments*  approach  to  speech 
synthesis,  plus  their  speech  processors  and  speech  RONs. 

General  Instruments  markets  several  speech  interface 
chips,  plus  a  complete  speech  synthesis  module.  Their 
speech  synthesis  module  VSN2032  combines  the  SPO250 
synthesizer  chip,  a  PIX1650A  microcomputer  (for  formatting 
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The  SP02S&  a  stand-alone  speech  proceaaor.  is 
capable  of  producing  ft  to  20  seconds  ot  natural 
speech,  or  40  seconds  of  robotic  speech  from  its 
internal  ROM.  Using  external  ROMs,  the  chip  can 
be  expanded  to  address  up  to  4$1K  bits  of  mem¬ 
ory  directly— up  to  810  seconds  ot  natural  speech, 
and  up  to  3388  sequences  of  words  or  phrases. 

It's  easy  to  expand  the  vocabulary  of  the  SP0258 
you  can  choose  one  or  mors  of  our  serial  speech 


SPEECH  PROCESSORS 


ROMs  (SPRO10.  SPR032  or  SPR128)  Or  uW  the 
SPR000  to  interface  with  other  standard  memories 
The  SP0256  can  aieo  be  easily  interfaced  to  Micro- 
corn  puter/M l cro processor  based  systems,  directfy 
or  through  FIFO  chips  (SP8512  or  SPB640)  Appli¬ 
cations  cover  the  entire  spectrum  from  low-cost 
high-volume  single  chip  producta  to  high-quality 
low-volume  products,  in  ail  market  tegmenta. 

Refer  to  the  table  for  other  design  options  available 
from  Qenerei  Instrument  including  speech  synthes¬ 
izers,  and  an  off-the-shelf  module  reedy  to  talk  with 
the  addition  of  a  power  source  end  speaker. 
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speech  data)  ,  and  an  RO-3-93  'll  ROM  (for  storing  speech 


data).  The  unit  is  all  contained  on  one  printed  circuit 
board  and  is  designed  to  function  as  a  speech  synthesis 
evaluation  module,  with  built-in  filter,  amplifier,  on-board 
calculator,  and  clock  vocabulary  of  32  words  and  syllables 
which  can  be  concatenated.  The  vocabulary  can  be  modified 
by  using  custom  ROMs  or  EPROMs.  Table  23  lists  the 
specif icaticns  for  General  Instruments'  speech  interface 
chips  and  speech  synthesis  module. 

Electron ics 

Interstate's  speech  synthesis  products  are  in  the  board 
level  product  classi  f  ication. 

In  the  preceding  chapter  of  this  report,  it  was  noted 
that  Interstate  is  a  major  manufacturer  of  speech 
recognition  products.  In  this  section  of  the  report,  we 
will  only  mention  the  characteristics  of  their  VTM150  voice 
response  module. 

The  VTM150  is  a  single  printed-circuit  board  capable  of 
phoneme  synthesis  (rule  synthesis)  of  isolated  or  connected 
speech.  The  board  provides  a  standard  fixed  vocabulary  of 
approximately  500  words,  and  a  user-programmable  vocabulary 
of  approximately  1000  words.  The  following  two  tables,  2 <* 
and  25,  describe  specifications  of  the  Interstate  VTM150 
voice  response  module. 
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TABLE  :M 


1  n  *■  ’•  r  s  *  a  1  e  VTM 15  0  Vo  j  ■  -o  Response  Module 


INTERSTATE 

ELECTRONICS 

CORPORATION 


VOICE  RESPONSE  MODULE 
Model  VTM150 


Interstate'*  stay Je»board  Yoke  Response  Module 


•  Single-board  voice  response  system 

•  500-word  fixed  vocabulary 

•  1000-word  user  programmable  vocabulary 

•  High-quality  synthetic  speech 
e  Multibus™  parallel  interface 

•  Serial  and  parallel  ASCII  communication 
ports 

•  2 -watt  output  to  external  speaker 

•  Vocabulary  generation,  editing,  and  playback 
commands 


Single- Board  Voice  Response 

The  VTM150  ^  a  single  printed-circuit  board  capable  ot 
phonemic  synthesis  ot  isolated  or  connected  speech  with  a 
low  data  rate  trom  a  controlling  host  processor  The  board 
provides  a  standard  fined  vocabulary  of  approximately  500 
words  and  a  user  programmable  vocabulary  of  approximately 
1000  words 

The  VTM1 50  is  controlled  via  1 2  commands  four  commands 
for  various  playOack  functions  with  and  without  editing,  four 
commands  to  control  downloading  with  and  without  editing, 
one  command  to  allow  uploading  all  or  specific  vocabulary 
terns  and  three  utilitv^system  control  commands. 

Voice  Response  Module  VTM150  contains  a  serial  and  a 
parallel  port  for  host  communication  via  ASCII  characters  and 
a  Multibus"-  parallel  interface  also  controlled  via  ASCII 
characters  Each  of  the  64  phonemes  and  4  inflection  levels 
for  each  phoneme  are  sent  from  the  host  to  the  VTM150  as 
two  ASCII  characters  to  generate  the  user  defined  program- 
maple  vocaoularv  Any  vocabulary  items  m  the  fixed  or  pro¬ 
grammable  memory  may  be  randomly  selected  and  output 
to  a  listener  via  an  ASCII  word  number 

The  VT Ml  50  delivers  2  watts  ot  audio  output  into  a  16-ohm 
speaker 

User  Configuration  Control 

The  VTM  150  board  contains  a  microprocessor.  4K -bytes  of 
program  EPROM.  4K-bvtes  ot  EPROM  tor  fixed  vocabulary, 


lOK-bvtes  ot  static  RAM  tor  programmable  vocabulary  and 
word  number  index  tile  a  parallel  output  port  with  speech 
synthesizer  integrated  circuit  and  power  amplifier,  a  host 
senal  and  parallel  port,  and  a  Multibus  interface 

User  configuration  control  is  provided  via  eight  control  lines 
to  the  parallel  I/O  port  shared  bv  the  speech  synthesizer 
Two  of  these  lines  select  the  user's  mode  of  communication 
to  the  host;  three  select  the  senal  word  format;  two  select 
the  parallel  handshaking  options,  and  one  selects  the  ter¬ 
mination  character  Configuration  control  may  be  accom¬ 
plished  by  either  external  TTL  logic  levels  or  directly  by 
onboard  switches 


VTM150  SPECIFICATIONS 
Performance 

Vocabulary  Size:  Approximately  500  words,  fixed:  1000 
words,  user  programmable. 

Host  Commands 
Playback  Commands: 

PL  -  Playback  word  or  words,  including  repeat  and  delay 
features 

PA  -  Append  and  playback 
PI  -  Insert  and  playback 
PM  -  Modify  and  playback 
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TABI.K  ,?5 


Inters'-  dte  VTM 1  50  Voi  -e  Response  Module  Fiow  <'hnr( 


1 


Edit/ Programmable  Vocabulary  Commands 

A  -  Append  (insert  phoneme  stnng) 

l  -  Insert  (insert  phoneme  stnng  at  specified  word) 

0  -  Delete  (by  word  or  words) 

M  -  Modify  (delete  end  insert  new  stnng  end  re-sequence) 

Save  and  Utility  Commands.- 

S  -  Save/upload 

AS  -  Clear  entire  programmable  vocabulary’ 

F  -  Free  (displays  available  RAM  and  last  word  number) 

B  -  Bit  set  (or  intercom  control  via  parallel  I/O 

•f-  Control  key 

Digital  Input/Output 

e  Parallel  TTl  input/ output  -  S  data  input,  S  data  output, 
and  4  control. 

a  Asynchronous  serial.  RS232-C  or  current  loop.  50  to 
19.200  baud  (switch  selectable) 

e  Multibus  parallel  data  transfer  -  S  bidirectional  data 
lines,  8  additional  interface  lines,  and  4  Multibus 
communication  lines. 


Audio  Output 

2  watts  into  a  18-ohm  speaker  with  onboard  audio  level 
adjustment 

Mechanical 

Cant  Slae:  675  x  12.0  x  0  062  inches  (standard  Intel 
Multibus  card  size) 

Connector 

Power:  88-pin,  0.1 56-inch  spacing.  Viking  2VH43/1  AN  or 
equivalent. 

Signals:  60-pin.  0.100-tnch  spacing  AMP  PE5  14559 
connector. 

Electrical 

Power  Requirements:  590  m A  at  +  5  Vdc .  1 00  mA  at  r- 12 
Vdc;  110  mA  at -12  Vdc. 

Environmental 

T empeeatum:  0  to  50°C. 

Multibus1**  is  a  trademark  of  the  Intel  Corporation. 


Mock  Diagram  of  Voice  Response  Module  VTM1J0 
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include  speech  synthesis.  In  particular,  note  their 
Kurzweil  Beading  Machine  Model  ITI.  Table  26  gives  a  short 
description  of  the  unit.  . 

Speech  synthesis  as  an  aid  to  the  blind  has  been  one  of 
the  earliest  applications  in  mind  for  voice  or  speech 
generation  devices.  Kurzweil  has  been  one  of  the  leaders  in 
considering  the  needs  of  the  blind. 


Maryland  Computer  Services 

Maryland  Computer  Services  produces  speech  synthesis 
terminals,  designed  to  interface  to  an  existing  computer 
system.  Thus,  their  products  are  between  purely  board  level 
products  and  total  systems  level  products,  which  generally 
include  a  host  computer. 

Maryland  Computer  Services  does  not  actually 
manufacture  speech  synthesizers,  but  includes  existing 
devices  in  their  products  which  are  basically  oriented 
toward  the  blind.  One  of  their  products  is  the  Total  Talk 
computer  terminal,  which  lists  for  $5,995.  This  unit  uses  a 
Votrax  VSB  synthesizer  board  which  is  a  phoneme  or  rule 
synthesizer  with  64  phonemes.  Tt  has  a  list  ot 

approximately  400  pronunciation  rules. 
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SPEECH  OUTPUT  TERMINAL  CAPABILITY  ADDED  TO 
KURZWEIL  READING  MACHINE 


The  Kurrweil  Reading  Machine  Model  III  now  incorporates  a  Speech  Output 
System  that  allows  the  Reading  Machine  to  function  as  a  full-word  speech  out¬ 
put  device  when  connected  to  a  suitable  computer  or  computer  terminal.  This 
system  permits  the  Reading  Machine  to  be  used  as  a  receive-only  terminal, 
analogous  to  a  computer  terminal.  Althouah  the  Speech  Output  System  does  not 
entirely  replace  standard  send-and-receive  computer  terminals ,  it  can  be  used 
in  conjunction  with  most  ordinary  terminals  to  produce  speech  output  in  place 
of  prmt-outs. 

The  Speech  Output  System  will  accept  ASCII  Text  presented  through  an  RS-132 
interface  which  is  located  at  the  rear  of  the  Electronic  Control  Unit  of  the 
Reading  Machine.  The  ASCII  Text  is  stored  in  a  20C0  character  buffer,  converted 
to  phonemes  and  synthetically  spoken.  A  complete  set  of  keyboard  instructions 
allows  the  user  to  back  up  in  memory,  repeat  previously  spoken  lines  or  words, 
spell  words  and  analyte  punctuation. 

The  complete  Speech  Output  System  is  contained  in  the  digital  cassette 
which  also  contains  the  standard  Reading  Machine  System.  The  Reading  Machine 
may  be  converted  into  a  Speech  Output  System  by  means  of  a  special  command  at 
the  keyboard.  While  the  RS-232  port  is  set  to  operate  at  4800  Baud,  the 
cotrqdany  will  modify  it  on  request  to  accept  any  Baud  rate  from  50  to  19200. 

This  computer  voice  output  capability  should  open  up  new  vocational  possi¬ 
bilities  for  the  visually  handicapped  in  such  places  as  data  processing  depart¬ 
ments  ,  financial  institutions,  reservations  offices,  and  customer  service  depart¬ 
ments,  in  which  the  ability  to  read  such  computer  information  is  a  must.  It 
will  also  greatly  facilitate  research  efforts  of  blind  students,  scientists,  lawyers, 
and  other  professional  who  need  access  to  computer  information. 


The  terminal  can  switch  from  full  words  to  spelled 
speech,  and  includes  an  adjustable  speech  rate,  pitch,  tone. 


and  volume  controls.  ""he  terminal  can  be  set  to 

au toma t  ica  1  ly  speak  information  going  to  or  from  the 
terminal.  The  unit  also  includes  a  speaking  cursor  key. 
Following  is  Table  ?7  which  lists  the  Total  Talk's 
specif ica t ion  s . 

Maryland  Computer  Services  also  manufactures  a  number 
of  other  systems  for  use  by  the  blind  which  incorporate 
speech  synthesis.  These  include  a  talking  telephone 
directory,  a  talking  information  management  system,  an 
automatic  form  writer,  a  talking  word  processing  system,  and 
a  talking  CFT  terminal.  These  products  illustrate  the 
increasing  use  of  speech  synthesis  in  commercial  products, 
in  an  application  where  it  is  of  special  benefit  to  blind 
users. 


jjimic 

lii  ic  is  a  manufacturer  of  board  level  speech  synthesis 
prod  ucts . 

The  nimic  speech  processor  is  designed  to  synthesize 
speech  on  smaller  computer  systems,  such  as  the  TFS-80, 
Upple  II,  gp-8%  etc.  The  unit  consists  of  a  board  which 
digitizes  speech  into  a  bit  stream  which  can  be  sampled  by  a 
computer,  stored,  and  played  back.  The  unit  consists  of  an 
analo  j-to-d  igita  1  converter  and  a  digita  1-to-analog 
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Specifications 

Baud  Rates  -  HO.  ’50.  200.  300.  600,  1 200.  1800. 
2400  3600  4800.  9600  and  external 

Asynchronous  Interface  —  E;A  Standard  RS  232C 

Aj.iv  '•  'atible  with  Ben  403A  modems). 


TOTAL  TALK  easily  connects  to  most  compute/  sys 
terns  either  directly  or  over  a  telephone  line.  The 
communication  parameters  are  set  'rom  the  keyboard 
and  handle  a  wide  range  ot  protocols.  All  oarameters 
can  he  vocalized,  enabling  the  blind  operator  to  cnange 
and  verdy  them 


Transmission  '  *,i  —  F u •  I  and  Half  duplex 

Asynchronous 

Operating  Modes  -  On  i_  np,  Off  Line.  Character  Line. 

pariTy  -  Se  ectabie  Even  Odd.  Zero  One 

Screen  Capacity  -  24  mes  x  80  columns 
1  920  characters). 

Display  Memory  —  48  -mes  x  80  columns 
1 3.840  characters) 

8  Cursor  Control  Keys  Numeric  Pad 
Cursor  Locator  Key  ASCII  Code  Keyboard 
Selectable  Tabs  and  Margins 


TOTAL  TALK'S  many  features  make  its  use  uncomph 
cated  and  straight  forward  Tabs  and  margins  are  easily 
set  The  cursor  locator  key  vocally  -nforms  the  operator 
of  how  many  characters  from  the  left  margin  and  how 
many  lines  down  from  the  *op  of  the  CRT  screen  that 
the  cursor  is  positioned  Standard  editing  capabilities 
include  nsertinq  lines  and  characters,  deleting  lines  and 
characters.  clearing  the  entire  display  and  ■  '.nderiintng. 
A  numeric  data  entry  pad  is  embedded  in  the  standard 
«eyt>oard  tor  easy  entering  of  numbers. 

por  more  information  contact 


Full  Editing  Capabilities 
Delivery  -  90  Days 


MARYLAND 

COMPUTER 

SERVICEW 
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converter  (or  a  si m  i la  r  1  y- opera  t  i  ng  codec),  with  appropriate 
downsa  a pi ing. 

The  Miuic  speech  processor  is  described  as  having  a 
data  rate  of  9600  bits/second,  which  is  a  relatively  high 
bit  rate.  Mimic  notes  that  a  400-word  vocabulary  can  he 
stored  on  one  side  of  an  R-inch  floppy  disk  (with  an  a"erage 
word  duration  of  .5  seconds). 

The  system  comes  whole,  or  in  parts.  A  fully  assembled 
and  tested  module  costs  $70  and  a  kit  for  the  module  costs 
only  $10. Ob.  Following  is  Table  2R  which  lists  available 
Mimic  units. 

Mimic's  speech  processor  unit  is  most  interesting,  and 
the  $10. 0b  board  kit  has  to  be  considered  an  unqualified 
tar  gain.  The  unit  appears  to  have  wide  applications  for 
e  xpe  ri  men  ters  interested  in  digital  sampling  and  playback  of 
voice.  The  unit  could  also  apparently  be  used  in 
conjunction  with  a  host  computer  for  speech  recognition, 
given  appropriate  processing  algorithms. 

M  SC 

MSC'r.  voice  output  products  are  board  level  products 
which  are  generally  designed  for  purely  commercial 
a ;  pi ic at  ions. 

r»gr-  manufactures  voice  output  products  which  use  LSI 
circuitry  for  actual  recording  of  input  speech  data  for 
subsequent  playback.  ""bus,  their  devices  are  not  truly 
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TABLE  28 


Mimic's  List  of  I' rod  in  t 


1.  User's  Manual  for  the  MIMIC  Speech  Processor t  $5.  Contains 
complete  theory  of  operation,  schematics,  assembly  drawings,  and 
3-100  Bus  interface  example  with  2-80  (8080)  driver  program. 

2.  MIMIC  audio  demonstration  cassette  tapei  $7.50*  Compares  MIMIC 
with  other  techniques  in  side-by-side  listening  tests. 

3.  Bare  one-sided  printed  circuit  board  for  the  MIMIC  Speech 
Processor i  $19 .95-  Build  it  yourself.  Manual  not  included. 

4.  MIMIC  Speech  Processori  $79  ($75  without  manual).  A  fully 
assembled  and  tested  module . 

5.  MIMIC  System  for  Radio  Shack's  TRS-8O1  $169-  Within  minutes, 
you'll  be  demonstrating  speech  I/O  on  your  computer.  Table  or 
wall  mount.  System  includes  manual,  microphone,  speaker  with 
volume  control,  power  supply,  and  a  special  cable  assembly. 

Plugs  into  printer  port  on  expansion  interface,  or  use  Radio 
Shack's  Printer  Interface  Cable  #26-1411  to  connect  to  bus. 

6.  MIMIC  System  for  Cromemco's  TU-ARTi  $169.  Similar  to  item 
#5  above,  but  with  a  different  cable  assembly. 

7.  MIMIC  System  for  Parallel  Porti  $149.  Can  be  wired  directly 
to  TTL  port  on  most  computers.  Similar  to  item  #5  above,  but 
uses  a  standard  DIP  jumper  instead  of  a  special  cable  assembly. 

**  Available  soom  MIMIC  Systems  for  ZX-80,  Apple,  H-8,  and  HP-85. 
Let  us  know  your  interests,  and  we'll  put  you  on  our  mail  list. 

**  Note i  For  all  MIMIC  Systems,  deduct  $4  from  list  price  if  a 
manual  i3  not  required,  and  $10  if  power  supply  not  required. 

8.  S-100  Bus  wire -wrap  MIMIC  interface  cardi  $79.  As  described 
in  manual,  fully  assembled.  Large  area  for  additional  user 
logic.  MIMIC  System  for  Parallel  Port,  without  power  supply, 
plugs  directly  into  this  card  (order  items  #7*8  for  $218  total). 

9.  STD  Bus  wire-wrap  MIMIC  interface  cardi  $79.  Similar  to  #8. 
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speech  synthesizers  in  the  strict  definition  of  the  tern 


Yet,  as  their  devices  perform  nearly  identical  functions  as 
do  coaparable  rule  or  analysis  synthesizers,  they  have  been 
included  in  this  section  dealing  with  voice  output  devices. 
MSC  notes  that  there  are  a  nuaber  of  advantages  to  their 
approach  as  compared  to  other  speech  synthesis  techniques. 
First,  their  devices  claim  excellent  voice  quality,  which 
is,  "indistinguishable  froa  live  voice".  Certainly,  this 
cannot  be  said  of  most  commercial  speech  synthesizers.  NSC 
also  notes  that  their  approach  uses  no  aoving  parts,  as  do 
analog  tape  transports.  Similarly,  NSC’s  LSI  circuitry 
avoids  audio  degradation  associated  with  repeated  playback 
of  audio  tapes. 

NSC's  top-of-the-line  device  is  the  1650  Programmable 
Voice  Readout  System  (VRS).  The  modular  design  of  the  1650 
VRS  can  accommodate  10  plug-in  circuit  boards,  each  with  a 
capacity  of  16  words  stored  in  fragments  of  406  milliseconds 
on  individual  ROMs  and  PRONs.  Thus  the  vocabulary  can  be 
expanded  to  160  words  of  the  user's  choice.  NSC  will  custom 
build  1650  systems  to  include  the  words  specified  by  the 
customer.  Table  29  describes  the  1650  VRS's  specifications. 
Mote  that  the  1650  comes  complete  with  ROHs  that  are 
preprogrammed  with  standard  NSC  words,  plus  programmable 
PONs  ready  to  accept  words  of  the  user's  choice.  The  1650 
has  a  list  price  of  S650  and  vocabulary  is  S50  per  digit. 
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1550 

Prognmmabla  Vole*  Readout  System  (VRS) 


•  Solid- stale  reliability 

•  fMe  variety  of  applications 

•  High-lidelity  voice  duplication 

•  ExpandaPle  to  160  sooken  words  ot  your  choice 

The  Model  1650  lets  you  add  custom  words  to  its 
standard  vocabulary  without  paying  custom  charges 
The  system  comes  complete  with  Read  Only 
Memories  (ROMs)  that  have  been  pre-programmed  with 
standard  MSC  words,  plus  ProgrammaCie  Read  Only 
Memories  (PROMs)  ready  to  accept  words  ot  your 
choosing 

its  modular  design  accommodates  10  ptug-m 
circuit  ooards  Each  ooard  has  a  capacity  ot  16  words 
which  are  stored  m  fragments  ol  406  milliseconds  on 
individual  ROMs  and  PROMs.  A  vocapuiary  can  be 
expanded  to  160  words  within  the  standard  Vi  ATR  rack 
The  desired  message,  which  can  be  accessed  instantly 
whenever  needed,  is  'spoken'  with  such  quaWy  and 
ciarity  it's  indistinguishable  tram  a  live  announcement 
This  proven,  binary  addressable  system  is  currently 
bemg  used  throughout  the  world  m  critical  applications 
such  as  aircraft  warning  systems,  hospitals  refineries, 
chemical  plants,  and  telecommunication  and 
information  systems  ot  every  kind 

The  Model  1650.  tike  all  our  sou-state  readout 
systems,  has  no  tapes  to  recxace  ana  no  moving  parts 
which  could  iam  or  wear,  it  operates  virtually 
mamtenance-free 

Specification! 

°hysical  Size.  Vi  standard  ATR  rack  (to  57"  W  x  5  2’  H 
x  858" D) 

Cutout.  Audio  -6  dbm  to  ooom  balanced 
yower  Supply  +12  VDC 
input  Power  25  wafts  (max) 

Operating  temperature  0*C  to  70*C 
input  Format  Binary  address 
Output  Transformer,  isolated  600  ohms  dual  audio 
circuit  output  provides  8  ohm  Q  250  mw  tor 
monitoring 

riQTE  Manv  standard  interlaces  are  availaDle  Please 
contact  the  factory  tor  detailed  information 

Ordering  Information 

aii  orders  tor  Model  '650  systems  are  custom  ouiit  to 
.nciuoe  the  words  specified  by  the  customer  MSC  uses 
a  SDeotication  Sheet  oroermg  sysiem  to  control 
ndiviouai  customer  requirements  and  assigns  a 
specific  pan  numoet  to  each  customer  Contact  the 
'acrarv  for  details  on  ordering  n  formation 


Timing  Diagram  for  Binary  Input 
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NSC  offers  the  1700  Voice  Readout  System  (VRS).  This 
unit  features  a  similar  approach  to  that  of  other  NSC  voice 
output  products. 

The  1700  VRS  is  designed  for  use  where  output  of 
medium-length  spoken  messages  are  required.  The  one-board 
unit  contains  circuitry  necessary  to  produce  16  words;  a 
second  circuit  board  nay  be  added  to  expand  the  vocabulary 
to  32  words.  Table  30  lists  the  1700  VPS’s  specifications. 
The  1700  VRS  has  a  list  price  of  S650  (with  50  digits). 

For  situations  requiring  vocabulary  changes,  NSC 
recommends  their  1750  VRS  system,  which  stores  individual 
words  on  programmable  Rons.  Thus,  any  vocabulary  can  be 
specified  without  incurring  setup  or  masking  charges.  Pause 
durations  of  the  1750  can  be  varied  from  0-150  msec.  The 
1750  has  a  list  price  of  $900  (with  10  words).  Table  31 
provides  a  description  of  the  VRS  1750  specifications. 

H  SC  also  markets  an  automatic  number  announcer  and  an 
audio  playback  unit  number  announcer  for  use  by  telephone 
companies.  For  repeated  broadcast  of  fixed  messages,  NSC 
markets  the  DCA-1  Dual  Channel  Annunciator  described  in 
Table  32.  The  unit  has  two  channels  for  simultaneous  output 
of  a  single  message  stored  on  programmable  FONs.  The  use  of 
PRONs  precludes  charges  for  setup  or  masking.  Standard 
message  lengths  are  available  up  to  five  seconds,  and  the 
memory  storage  section  can  be  expanded  for  longer  durations. 
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Timing  Diagram 
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•  Sow-state  rehat>iity 

•  Wide  variety  ol  applications 

•  Hgh-iioeiity  voice  readout 

•  Up  to  16  standard  words  on  one  circuit  board 

•  Expandable  to  32  words 

•  Packaqed  assembly  available 

The  Model  1 700  has  become  a  standard  m  the 
telecommunications  industry,  and  it's  rinding  new 
applications  every  day 

The  unit  is  deal  lor  practically  any  situation  wneie  a 
medium- length  spoken  message  is  requited,  such  as 
paging  systems,  computer  and  alarm  systems,  elevator 
lloor  announcements,  credit  card  verifications, 
malfunction  alerts  and  hotel/motel  wake-up  calls 
The  one-board  unit  contains  alt  the  circuitry 
necessary  lo  produce  '6  spoken  words  of  extraordinary 
fidelity,  ano  a  second  circuit  board  may  be  added  to 
expand  its  natural  sounding  vocabulary  to  32  words 
Since  all  words  are  stored  in  separate  Read  Only 
Memories  (ROMs),  they  can  easily  be  added  up  in  any 
sequence  desired  The  '6  standard  spoken  words  are 
"zero’  through  ‘nine.’  ’plus."  "minus."  "times."  "divide.' 

"equal"  and  "point  *  Additional  words  of  your  choosing 
am  subiect  to  a  one-time  setup  charge  The  first  "0 
numeric  words  accept  either  binary  address  or  10 
mutually  exclusive  switch  closures  Additional  words 
must  utilize  binary  address 

MSC  s  Model  1 700  VRS  is  av  triable  as  a  circuit 
board  only,  or  as  an  enclosed  assembly 

Specifications 

Physical  Size  8"  W  x  «*■  H  x  5Vi"  D 

Output.  Audio  -6  dbm  to  0  dbm 

Power  Supply  ±  1 2  VDC  and  +5  VDC  or  #6  VDC 

Input  Power  2  6  watts  (max)  for  1 0  words 

Operating  Temperature  0°C  to  70”C 

Output  Transiormer  600  balanced  and  8  ohms  260  mw 

Standard  Interface  34  pin  3m  rrbbon  P/N  34 14-0000 

Ordering  Information 

1 7Q0  E  — 03  Plus.  Minus,  Pomt 

r-^CIT- T=T 

^xiei  *  sure  Mtancgrotwaos  Acuta*  wcrJs  n 

Ni/r>08r  'r>  ■means  DWwflWJ  twvono  KltteSCTgnrv/ngncworOS 
noara  ’Qfxr'enc 

)Ory|  eon* 

The  standard  Model  1 700’s  shown  on  the  chart  below 
include  the  '  0  numeric  words  zero  to  nme.  The  number 
of  additional  words  specified  by  the  customer  beyond 
the  first  10  numeric  words  must  be  added  to  the  model 
number  as  a  dash  number  Up  to  22  additional  words 
may  be  specified  ll  your  system  does  not  require  the 
first  i0  numeric  words,  consult  factory  tor  special  model 
number  Any  additional  requirements  not  covered  by 
these  models  may  be  ordered  by  consulting  the  factory 
tor  details 
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TABliK  13 


MSC  3  7J50  Voi  cp  Rosponni'  Sysl  on 


1750 

Voice  Reeponee  System 


Timing  Diagram 


•  Up  to  32  custom  words  without  custom  charges 

Like  the  Model  1 700,  this  system  leatutes  a  16-wotd 
spoken  vocaDutary  expandable  to  32  (using  two  circuit 
boards). 

And  it's  simitar  to  the  Model  1 700  m  more  wavs  than 
one.  for  example,  it  operates  virtually  maintenance-tree 
because  it  has  no  tapes  or  moving  pats  ol  yy  kind. 

And  its  reliable.  soM-state  circuitry  provides  excellent, 
natural  sound  mg  voice  reproduction  that  can  scarcely 
be  distinguished  from  the  original 

The  mam  difference  is,  the  Model  1750  stores 
individual  words  on  Programmable  Bead  Only 
Memones  (PROMs),  instead  ol  BOMs.  That  way,  you  can 
specify  any  vocabulary  you  like  without  incurring  setup 
or  masking  charges 

So,  tor  applications  requiring  vocabulary  changes, 
the  Model  1 750  is  a  wise  choice 

it  accepts  binary  address  only,  and  features  a 
Pause  Override  Control  that  lets  you  adiust  the  duration 
ol  the  pause  from  o  to  t50  microseconds 

Specifications 

Physical  Size:  8*  W  x  W  h  x  sfe"  D 

Output,  Audio  -6  dbm  to  0  dbm 

Power  Supply  £12  VDC  and  +5  VDC  or  #6  VDC 

input  Power  2.5  watts  (max)  lor  10  words 

Operating  Temperature  0°C  to  70°C 

Output  Transformer  600  balanced  and  8  ohms  250  mw 

Standard  Interface:  34  pm  3m  ribbon  P/N  34  1 4-0000 
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TABLE  33 


MSC  DCA-1  Dual  Channel  Annun<-i'ifor 


DCA-1 

Dual  Channel  Annunciator 


•  SoM-staie  reliability 

•  Completely  automatic 

•  Simultaneous  output 

•  Easy  installation 

•  Low  maintenance 

For  tepeatea  Dtoaacastmg  ot  tixed  messages.  MSC  s 
reliable  DCA-t  is  hard  to  beat 

it  features  irle-irke  voice  reproduction  with  natural 
attenuation  ana  spacing  Messages  tow  smoothly  ana 
are  easily  unaerstooa 

The  DCA-t  provides  two  independent  channels  tor 
simultaneous  output  ot  a  single  message  stored  on¬ 
board  m  Programmable  Read  Only  Memories  (PROMs) 
The  use  ot  PROMs  precludes  any  charge  lot  setup 
or  masking 

Standard  message  lengths  are  avaiabie  up  to  5 
seconds,  although  the  memory  storage  section  may  be 
expanded  to  accommodate  longer  durations 

There  are  ro  recording  tapes  to  stretch  or  Dfeak 
and  no  moving  parts  to  wear  Maintenance  of  this  all- 
solid  state  system  is  almost  zero 

Specification* 

Physical  Size  to  5’  w  x  9  4"  D  x  i  5”  H 
Output  Audio  -6  00m  +  Q  dBm 
Power  Supply  -  48VDC  ( a  5  yDC  reguiateo  lamp 
£  1 2VDC  lequlatea  100  ma) 

Input  Power  30  Watts 
mput  Format  Switch  closures 
Operating  Temperature  0°C  1C  TO'C 

Ordering  Information 

Consult  factory 

Output  Connectlone 
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National  Senicond uctor 


National  Son iconductor  markets  speech  synthesis  chips 
which  are  for  0.  E.  1.  use.  This  involves  incorporating  speech 
synthesis  chips  into  existing  circuit  boards.  Bather  than 
describing  National’s  extensive  line  of  LSI  chip  products 
related  to  voice  output,  this  report  focuses  upon  their 
basic  synthesizer  chips  only.  National's  LSI  chips  are  sold 
without  any  control  software. 

National  Semiconductor  notes  that  their  digitalker 
chips  will  synthesize  voices  for  males,  females,  or 
children.  They  market  two  basic  synthesis  chips. 

Pirst,  there  is  National’s  DT  1050  Digitalker  kit. 


This  chip  is  intended,  generally,  for  O.E.N.  applications 
(calculators,  etc.).  This  kit  features  a  chip  with  1 17 
words,  two  tones,  and  five  pause  durations.  This  kit  sells 
for  approximately  $90.  It  can  be  used  in  conjunction  with 
various  computers,  where  the  user  can  supply  appropriate 
control  software.  Table  13  gives  a  block  diagram  of  the 
DT  10 50  kit. 

National  also  markets  the  DT  1000  Digitalker.  This 
unit  features  a  133-  word  vocabulary,  five  silence  durations, 
and  a  1/2-watt  on-board  amplifier. 

Note  that  SCR L  did  not.  receive  any  reply  to  letters 
sent  to  National  Semiconductor  inquiring  about  their  speech 
products.  Descriptions  of  National's  speech  synthesizers 
came  from  computer  magazines,  where  their  chips  are  commonly 
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OT  National 
'£4  Semiconductor 

DT1050  DIGITALKER™  Standard  Vocabulary  Kit 


General  Description 

The  OiGfTALXER™  la  a  speech  syntneele  syitem  con- 
titling  of  several  N-env,nl  MOS  integrated  circuit*.  It 
contain*  a  speech  pracaaaor  cmp  l  SRC)  and  apaacn  ROM 
and  Whan  uaad  with  asternal  tiltar,  ampilhar,  and  apaakai. 
producaa  a  system  which  ganarataa  nigh  quality  apaach 
including  iha  natural  Inflection  and  amphana  of  tha 
original  speech.  Mala,  famala,  and  chlldrin'a  voice*  can 
0*  synthesized. 

Tha  SRC  communlcala*  with  tha  apaach  ROM,  which  cor*, 
laina  tha  compraaaad  apaach  data  aa  wad  aa  tha  fraquan- 
cy  and  ampiituda  data  raquirad  for  apaacn  output.  Up  to 
128k  this  of  apaacn  data  can  ha  dlractfy  accessed. 

Witn  tna  addition  of  an  external  raaialor,  oivehio  aa- 
hounca  i»  promdad  for  uaa  with  a  switch  interface. 

An  mtarrupt  ia  ganaraiad  at  tha  and  of  aacn  apaacn  aa- 
quanca  ao  that  several  aaquancaa  or  woraa  can  ha 
caacadad  to  form  diffarant  apaach  expression*. 

Tha  OTioso  is  a  standard  DIGITALKER  kit  ancodad  witn 
137  saparata  and  uaafui  words,  2  tones,  and  3  diffarant 
silanes  durations.  ($sa  tha  Mastar  Word  List  Tabla  l).  Tha 
wonts  and  tonaa  nave  oaan  assignad  dlscrata  address**, 
making  it  possibla  to  output  sing  la  words  or  words  con. 
catanatad  Into  pnraaaa  or  even  sentence*. 


Tha  “votes"  output  of  tha  OTlOSO  la  a  highly  Intslllgibia 
mala  voice  Tha  vocabulary  is  eh oaan  ao  that  it  is  applies- 
bla  to  many  products  and  markets. 

Features 

a  COPS™  and  MICROBUS™  eompetlDf* 
a  Designed  to  be  easily  interfaced  to  other  popular 
rmcraprocaaaora 

a  1A4  addressable  expressions,  including  numbers 
a  Natural  inflection  and  emphasis  of  original  speech 
a  Addresses  ia«k  of  ROM  directly 
a  TTL  compatible 

a  Omchlp  switch  denounce  for  interfacing  to  manual 
switches  independent  of  a  micro  processor 
a  Interrupt  capability  for  cascading  words  or  phrases 
a  Crystal  controlled  or  externally  driven  oscillator 


Applications 

a  Telecommunications 
a  Appliance 
a  Automotive 
B  Teaching  aide 


a  Consumer  products 
a  Clocks 

a  Language  I  ran  elation 
a  Annunciator* 


i 


■arketed  by  second  parties.  Their  chips  allow  the  user  to 
concatenate  phonemes.  Although  National's  synthesizer  chips 
output  speech  which  is  imaediateLy  distinguishable  froa  live 
speech,  they  do  so  at  a  relatively  low  price. 


Pe rcoa  Data  Co. 

Percom’s  speech  synthesis  products  are  in  the  category 
of  what  night  roughly  be  characterized  as  board  level 
products.  Their  peripheral  devices  are  designed  to  plug 
into  smaller  computer  systems. 

Percom  markets  a  variety  of  peripheral  devices  for 
smaller  computer  systems,  such  as  the  TRS-80.  One  of  these 
devices  is  a  module  designed  to  let  users  control  LPC 
synthesized  output  from  the  Texas  Instruments'  (TI's)  Speak 
6  spell  unit.  The  unit  uses  a  9-volt  battery  or  a  standard 
calculator  power  pack.  The  unit  requires  Level  TI  BASTC,  a 
UK  memory,  and  an  expansion  interface  or  printer  cable 
adaptor.  Following  is  Table  34  on  the  Speak- 2-fle- 2 

interface  module  which  retails  for  $69.95. 

I®l®s§osorjt  Speech  Systems 

Telesensory *s  speech  synthesis  products  are  in  the 
category  of  board  level  products. 

Telesensory  markets  a  number  of  synthesizers  which  use 
stored  LPC  coefficients  for  actual  synthesis.  The 
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Percom  Speak-2-Me-2  Interface  Module 


SPEAK-2-ME-2  — 
The  Gift  of  Speech 

7?iis  clever  interface  module 
makes  a  Texas  Instruments' 
Speak  &  Spelli  the  voice 
of  your  computer  Install 
it,  hook  up  your 
computer  and  add  the 
dimension  of 
speech  to  busi- 


and 


n  f. 


ness,  education 
game  programs. 

Speech  is  controlled 
at  the  keyboard,  or  by  your  own  Level  II 
BASIC  programs  which  output  whole 
sentences  with  a  few  program  lines 

The  SPEAK-2-ME-2  module  installs  in 
the  battery  compartment  of  a  Speak  & 
SpeDt  Some  modification  of  the  Speak 
&  SpeDt  is  necessary.  Rawer  is  provided 
from  an  ordinary  calculator  power  pak  or 
a  nine-volt  battery. 

SPEAK  2-ME-2  includes  an  intercom 
netting  cable  for  the  TRS-80*  computer 
and  a  comprehensive  users  manual.  The 
users  manual  includes  Level  II  BASIC 
listings  of  the  primary  driver  program  and 
application  examples. 

System  Requirement* 


Level  11  BASIC,  4  Kbytes  of  memory  and 
either  an  Expansion  Interface  or  Printer 
Cable  Adapter  are  required.  The  Speak 
&  Spellt  device  and  power  pak  must  be 
provided  by  the  user 

Advanced  Speech  Driver  &  Games 
Diskette 

This  diskette  contains  eight  speech 
enhanced  games  and  a  driver  program 
which  permits  your  Level  II  BASIC  calling 
program  to: 

1  Speak  any  word  or  phrase 
from  the  internal  word  list  of 
Speak  &  Spellt . 

2.  Speak  parts  of  words  and 
phrases 

3.  Speak  a  word  or  phrase  at 
one  half  normal  speed. 

The  diskette  also  includes  the  primary 
driver  program  listed  in  the  SPtAK-2- 
ME-2  Users  Manual. 
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synthesizers  which  will  be  reviewed  in  this  report  are  the 
Speech  1000  LPC  board,  the  Series  ITT  Sj>eech  Synthesizer 
nodule,  a  prototype  text-to-speech  system,  and  the  S2P  and 
S2C  synthetic  speech  boards. 

The  Telesensory  1000  LPC  board  is  noted  to  have 

superior  voice  quality  and  the  capacity  for  large  vocabulary 
storage  (up  to  458K,  typically  200-300  seconds).  The  unit 
also  features  a  variety  of  common  interface  options, 
including  the  popular  RS-232C.  The  unit  features  a  number 
of  variable  parameters  for  synthesizing  speech,  such  as  a 
variable  and  programmable  audio  gain  and  output,  speech 

speed  control,  and  interword  pause  control. 

Telesensory  Speech  Systems  notes  that  the  Speech  1000 
board  is  applicable  for  all  1  nguages.  for  natural 
intonation,  Telesensory  suggests  building  sentences  around 
phrases  or  other  sentences.  Following  is  Table  35  with  the 
Telese nsory •  s  Speech  1000  board,  including  a  block  diagram 
of  the  system's  configuration.  The  Speech  1000  board  has  a 
retail  price  of  $1,  200  for  single  units.  Tt  is 

Te lese nsory' s  top-of- the-line  synthesizer. 

Telesensory  Speech  Systems  produces  the  Series  TTI 

Speech  Synthesizer  Nodule.  This  unit  is  lower-priced  than 
the  Speech  1000  board,  costing  $295  to  $  395,  depending  upon 
options  selected.  The  Series  IIT  Synthesizer  can 
accommodate  both  custom  and  standard  vocabularies  in 
standard  ROMs  or  EPROfls  up  to  a  capacity  of  256  utterances. 


TABLK  3') 


Tolosensory  Speech  Sysi ems  1000  Board 


System  Specifications 


Synthesizer 

Telesensorys  PDSP  I  Programmable 
Digital  Signal  Processor)  chip  set 
implementing  a  12  Pole  Lattice  Filler 
Sinicnire 

Speech  Encoding 

Linear  Predictive  Coding:  2200  hits  per 
second  of  speech  is  standard,  other 
encoding  rates  available 
Vocabulary  Capacity 
Approximately  200  seconds  of  speech  ai 
2200  bps  encoding  rate,  up  lo  300 
seconds  at  lower  rates 
The  available  lime  may  be  used  (o 
siore  any  number  of  words,  phrases  or 
sentences 

Vocabulary  Memory 

Total  of "  28-pin  sockets 
Available  for  ROM.  EPROM  or  RAM 
Total  capacity  of  A$8k  bits  of  standard 
semiconductor  memory 
Interfaces 
Mulubus:  I/O  Slave 

Serial  Ron:  (RS252C)  300-  9600  Baud 
i Jumper  Select) 

Parallel  Port:  (TTl)  8  bits  (Data) , 

5  bits  (Control) 

Audio 

2  Watts  into  8  ohms 

Low  Pass  Filter.  fc=A  8kHz  @-6dB 

Rolloff:  i2dB/octave 

Programmable  Amplitude  Level:  8  levels. 
3dB/level 

Programmable  Speech  Speed:  2X  normal 
lo  CcX  normal 


Power 

r  3V  ai  2  amps  (max  ) 
t-l2V  at  |  amps  imix.i 
—  12V  at  o  l  amps  (max.) 

Size 

Intel's  Multibus  Board  Form  Factor 
6  "5"  x  1100“  x  O.W  iP.lSan  x 
30  AHcm  x  1,2'cm) 

Wcighl 

16  az  (454  Km) 

Operating  Temperature 

0°C  to  $5°C 


SPEECH  1000™  SYSTEM  BLOCK  DIAGRAM 
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for  a  total  time  of  approximately  100  seconds  of  synthesized 
speech.  The  unit  is  a  complete  voice  response  system, 
including  an  on-board  audio  amplifier.  The  unit  interfaces 
directly  with  most  popular  buses,  including  TT L  compatible 
T/O  port,  or  simple  logic  controllers.  The  unit  features  a 
distinctive  male  voice,  and  has  a  relatively  large 
vocabulary  capacity.  The  unit  is  powered  by  a  single  ♦ 5V 
power  supply.  Following  are  Tables  36  and  37  covering  the 
Series  III  Speech  Synthesizer  Module. 

Telesensory  Speech  Systems  notes  that  they  are 
developing  a  prototype  te  xt-to-speech  system,  which  will  be 
a  stand-alone  unlimited  speech  peripheral  device.  The  unit 
will  feature  an  RS-232C  interface.  The  text -to-speech 
system  will  include  some  prosodic  features  for  sentences. 
Basically,  the  unit  is  described  as  having  two  modes:  1) 

lexical  -  for  normal  stress  patterns,  and  2)  prosodic  - 
where  whole  phrases  are  analyzed,  and  words  are  stressed  in 
relation  to  surrounding  words. 

Telesensory  markets  two  mini  circuit  boards  for  speech 
synthesis.  The  S2B  and  5I2C  boards  feature  the  minimum 
components  necessary  for  speech.  The  units  include  one  or 
two  16K  ROMs,  depending  upon  vocabulary  selection,  plus 
clock  frequency  circuitry.  Available  vocabularies  include  a 
24-word  calculator  vocabulary,  and  two  64-word  general 
purpose  vocabularies.  The  units  are  based  upon  the  CRC 
synthesizer  chip,  which  costs  $65  with  vocabularies  running 
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an  additional  $30  to  $60.  Following  is  Table  38  of  the 
Telesensory  mini  circuit  synthesizer  boards. 

Telesensory  offers  a  wide  variety  of  LPC  synthesizer 
board-level  products.  Telesensory  particularly  emphasizes 
the  vide  variety  of  custom  vocabularies  that  they  have 
available,  including  numerous  foreign  languages. 

Finally,  Telesensory  notes  that  they  have  several  nev 
products  vhich  are  either  available,  or  coming  out  soon. 
First,  there  is  a  real-time  text-to- speech  rule  synthesizer. 
This  unit  converts  ASCII  characters  to  speech,  via  a 
cascade/parallel  synthesizer,  and  should  cost  around  $3,500. 
This  unit  should  be  most  interesting  to  evaluate,  for  it 
vould  appear  to  be  the  first  commercially-available  product 
to  incorporate  a  cascade/parallel  synthesizer.  Second, 
Telesensory  has  just  brought  out  the  speech  1020  unit,  vhich 
is  a  speech  1000  unit  vith  a  self-contained  unit  vith 
internal  pover  supply.  The  unit  is  called  the  R5C1020,  and 
it  sells  for  approximately  $2,500,  vith  vocabulary  an 
additional  cost.  Telesensory  especially  emphasizes  their 
custom  vocabulary  capabilities  and  their  ability  to  serve 
customers  vith  relatively  lov-volume  needs. 

As  mentioned  earlier  Telesensory  Speech  Systems  has  a 
telephone  line  for  demonstrating  their  synthesizers:  call 
(415)  968-6257.  Their  telephone  demonstration  includes 

their  nev  text-to-speech  unit. 
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Tolcsensory  Mini  CircuB  Syni  hosizor  Boards 


Speech  Synthesizer  Module 


DESCRIPTION  OF  OPERATION 


Originally  developed  for  use  in  TSI's  talking  calculator  for  the  blind, 
we  are  now  making  our  unique  speech  synthesizer  circuit  ooards  availa¬ 
ble  for  small  computer  and  OEM  applications.  Pre-programmed  vocabu¬ 
lary  data  is  stored  m  either  one  or  two  16K  MOS  ROM.s  (depending 
on  the  number  of  words  in  the  vocabulary  I.  When  provided  with  a  6-bit 
parallel  binary  address  code  and  a  START  signal,  the  custom  LSI  ROM 
controller  (CRC)  fetches  appropriate  control  data  from  the  ROM,  deter¬ 
mines  the  speech  characteristic  of  the  word,  and  converts  the  digital 
information  to  an  analog  audio  signal  via  an  on-chip  D/A  converter  The 
analog  then  requires  filtering  and  amplification.  The  result  is  a  clear, 
highly  intelligible  male  voice  The  operation  of  the  board  is  described 
m  the  block  diagram 


analog  voice  out 

6-bit  par allal 

1 

itart  signal 

•1SV  CRC  Powarr 

Microcontroller 

busy  signal 

IT* 

i  I  _ 

-5  V  ROM  Power 

ROM 

Spaach  Synthesis 
Control  Data 

1 

A  VARIETY  OF  VOCABULARY  CHOICES 


Mini  Circuit  Boards 

Mmi  Circuit  Boards  are  small  PC  boards  measuring 
less  than  3.10"  square  which  provide  the  minimum 
necessary  components  for  speech  synthesis  the  CRC 
micro-controller,  one  or  two  16K  ROM's  (depending 
on  the  vocabulary  selected),  and  Clock  frequency  cir¬ 
cuitry  Vocabularies  available  include  the  24-wora  cal¬ 
culator  vocabularies  described  under  Calculator  Speech 
Synthesis  Module  as  well  as  two  64-word  general- 
purpose  vocabularies 


'i'i||,|itltl['if|iiTi,|i»l'^l,i,j^ll,LT|,l,ii|,i,l,Lll,i,l1*1 

•  ftawr.  -5V  end -15V 

•  tn  addition  to  power,  an  audio  filter  circuit  (described  in  the 
Engineering  Note  which  accompanies  the  board i  an  audio 
amplifier,  and  a  speaker  must  be  provided  by  the  user. 

•  Intarfeca:  Double-sided  edge  connector,  ten  pins  each  side. 

•  Can  be  made  TTl  compatible. 

S2 A  24 -word  Calculator  This  mod*  alio  wsMatoto  in 

Vocabulary  French  (S2*>  and  German  I52DI 


Calculator  Spaach  Synthaais  Module 
Features  end  Specifications 


•  Calculator  Vocabulary 
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lour 

•rght 
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mrn  (ml 
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one 
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two 

ti« 

times-mtnut 

over 

point 

deer 

seven 

BQUStS 

root 

overflow 

•w«D 

Custom  Vocabularies 

A  custom  vocabulary  can  be  programmed  to  fit  your  particular 
aoplications. 


Limited  Warranty 

The  Speech  Synthesis  Module  is  warranted  against  defects  in 
material  and/or  workmanship  for  a  period  of  90  days  from  the 
date  of  delivery  Upon  specific  written  request,  a  copy  of  the 
complete  product  warranty  may  be  obtained  free  of  charge 
from  Telesensory  Systems.  Inc.,  at  the  address  stated  below 


TELESENSQKY  * 
Speech  Systems 

>408  Hillvsew  Avenue  •  PO  Bo*  10099 
Palo  Alio.  California  94)04 
(41))  493-2626  •  Telex.  348332 
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Tejas  Instruments 


Texas  Instruments  (TI )  markets  speech  synthesis 
products  which  are  in  the  board- level  category.  The  company 
makes  a  variety  of  LPC  synthesizers. 

Since  the  synthesizers  rely  upon  an  analysis  synthesis 
approach,  actual  speech  signals  are  analyzed  and  only  the 
main  spectral  characteristics  are  reproduced.  TI  notes  that 
there  are  two  main  types  of  analysis  synthesis  synthesizers: 
1) formant  and  2) LPC.  The  first  synthesizers  produced  were 
the  basic  formant  synthesizers,  followed  by  the  currently 
popular  LPC  synthesizers.  For  both  types  of  synthesizers, 
the  use  of  downsampling  reduces  the  bit  rate  from  the 
original  speech,  on  the  order  of  100  to  1.  This  is  essential 
where  memory  space  is  limited.  However,  very  low  data  rates 
can  lead  to  relatively  low  quality  voice  output,  so  it  is 
important  to  reproduce  the  essential  acoustic 
characteristics  of  human  speech.  One  advantage  of  LPC 
synthesizers  is  that  they  reduce  coarticulation  problems 
associated  with  rule  synthesizers,  since  they  model  output 
based  upon  real  human  speech. 

Texas  Instruments  has  just  introduced  three  new  voice 
synthesis  processors:  the  TflSSlOO,  the  TWS5200,  and  the 

THS5220  chips.  Quantity  discounts  are  available  for  these 
chips,  which  otherwise  range  from  approximately  $30  to  $45. 

TI  has  several  voice  synthesis  memories  available  for 
use  with  their  voice  synthesis  processors.  These  are  the 
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THS6  100  and  TflSfc125  chips.  Thes*  LSI  chips  are  also 
relatively  inexpensive,  with  quantity  discounts  available 
for  the  O.E.H.  market. 

T I  markets  several  evaluation  kits  for  their  voice 
output  products.  First,  they  have  their  s  PSB100 1-0 11 
evaluation  board.  This  is  a  snail  board  intended  for  O.E.fl. 
users.  The  unit,  which  contains  no  microprocessor,  is 
capable  of  synthesizing  eight  phrases.  It  uses  a  9-volt, 
power  supply,  and  sells  for  approximately  $99.  Second, 
costing  an  approximate  $1,000,  is  TI’s  RS232  speech 
evaluation  board.  This  board  is  designed  to  plug  into  RS232 
interfaces  and  comes  with  a  25-word  vocabulary  (expandable 
to  approximately  1000  words  with  additional  ROIls).  This 
board  is  available  only  from  TI’s  Regional  Technology 
Centers.  Pinally,  TI  markets  the  S200  series  evaluation 
board  for  $499.  The  unit  has  less  memory  than  the  RS232 
board,  but  still  has  variable  intonation. 

Texas  Instruments  also  markets  microcomputer  board 
products.  These  include  the  TH990/306  speech  module,  with  a 
standard  200-word  industrial  vocabulary  (up  to  400  words 
when  mask-programmed  BONs  are  used  for  storage).  The  unit 
sells  for  $1,200.  It  is  also  available  without  the  standard 
vocabulary  for  applications  using  customer- specif ied  words 
(as  the  TH 990/306- 2) .  TI  notes  that  this  unit  will  be 
replaced  soon,  and  that  they  currently  have  new  voice 
synthesis  products  coaing  out  at  a  rapid  rate.  A  number  of 


these  new  products  will  offer  additional  capabilities,  su cb 
as  allophone  dictionaries,  variable  intonation,  sound 
effects,  etc,  A  further  trend  in  this  area  will  be  an 
increase  in  perforaance  to  price  ratio. 

Texas  Instruaents  also  specializes  in  custoa  speech 
boards,  for  very  specific  customer  needs.  TT  stresses  its 
ability  to  quickly  produce  custoa  speech  boards  (often  as 
rapidly  as  9  months).  TT  aarkets  custoa  speech  boards  even 
for  low-voluae  applications. 

TT  also  notes  that  they  offer  courses  in  speech 
synthesis.  One  popular  approach  has  been  for  customers  to 
purchase  an  evaluation  board,  attend  Tl's  course  in  speech 
synthesis  (S150),  and  leave  this  course  with  a  working 
knowledge  of  how  to  get  their  evaluation  board  kit  operable. 

A  recent  addition  to  the  TT  product  line  has  been  the 
talking  Loran  C  Navigator.  Loran  C  is,  of  course,  the  V.  S. 
Coast  Guard's  main  navigational  system  (hyperbolic 
navigation).  The  TT9900  and  the  TT9900N  with  speech  option 
announce  Loran  C  navigation  information  in  ships  and  boats, 
dp  to  four  items  may  be  selected  for  announcement  from  the 
following  list: 

1)  time 

2)  position 

3)  speed  over  the  bottom 

4)  range  to  waypoint 

5)  time  to  go 

6)  cross -tr ack-err or 
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7)  course  made  good 
6)  bearing  to  waypoint 

The  unit  aay  be  set  to  announce  its  four  Messages  at 
intervals  ranging  from  6  seconds  to  1  hour.  It  announces 
power-up  status,  system  warnings,  and  entry  corrections. 
The  unit  sells  for  $695  plus  installation.  This  price  is 
very  conpetetive  even  though  the  synthesis  system  is  very 
"special  purpose"  oriented. 


Votrar 

Votrax  voice  synthesis  products  may  be  divided  into 
board  level  and  chip  level  products. 

votrax  currently  markets  at  least  four  products  based 
upon  their  SC-01CHOS  Phoneme  Speech  Synthesizer.  This  is 
essentially  a  rule  synthesizer,  which  can  phonetically 
synthesize  continuous  speech,  of  unlimited  vocabulary,  from 
low  data  rate  inputs.  This  latter  point  is  t.he  main 
advantage  of  rule  synthesizers,  in  that  such  synthesizers 
typically  require  large  storage  areas,  which  tends  to  limit 
the  potential  size  of  the  output  vocabulary.  The  SC-01  unit 
consists  of  a  single  chip  containing  6<J  different  phonemes 
which  are  accessed  by  a  6-bit  code.  The  proper  sequential 
combination  of  these  phoneme  codes  creates  continuous 

speech.  Note  that  the  Sr-01  is  a  very  cost-effective  unit, 
priced  at  $55  (quantities  of  five  or  aore) . 
the  characteristics  of  the  SC-01  chip. 
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TABLE  39 


VoTax  SC-01  Syn  •- ho  <?  i  zo  r  f’Vjp 


to/lax 

A  Division  of  Federal  Screw  Works 
500  Stephenson  Highway 
Troy.  Michigan  46064 


SC-01  SPEECH  SYNTHESIZER 

DATA  SHEET 


Votrax*  CMOS  Phoneme  Speech  Synthesizer 


GENERAL  DESCRIPTION 

frit  SCOt  So— Cft  Synth«ti2*r  I  a  complattly  s«lf -contain— 
sol'd  itw  devct  This  singia  chip  phooaticaiiv  synthaanta 
con».nuou*  vp— ch  of  unlimited  vocabulary.  from  low  data 
ata  nout$  Pigura  t 

speecn  \  \y  ntft«ii/cd  by  combining  phonames  'the  building 
blocks  ot  speechi  ' n  tha  sooropnat«  sequence  Th#  SCOt 
jt>«ecYr  Synches'/*'  cOntiint  64  df Harem  phonemes  which  art 
accessed  by  4  6bti  coda  t  >%  tha  pro  oar  sequential 
-.ombutetion  ot  these  ohoneme  cod—  that  craatat  continuous 
soaach 

r*a  sC  Ot  So— ch  Synthesizer  is  cost -effective.  consumes 
-H'.mai  power  end  tnaWes  n-house  product  development 
wthout  van  dor  dependency  Signals  from  tha  SCOT  ar» 
jooiiad  'o  an  audio  output  device  to  amplify  and  dittnbuta 
thr  synthesized  speech  S—  Figure  2. 


Figure  t  Vorrar®  SC -Of  Sp— ch  Symbetizar 


FKATUP6S 


•  Single  CMOS  chip 

•  70  bits  par  second 

•  22  o»n  package 

•  9  ti#.  current  dram 

•  W'de  voyage  supply  rang* 

•  Latch—  5V  compatible  moult 

•  Oigitai  pucft  leva i  inputs 

•  Automatic  inflection 

•  On<h»p  rn—taf  clock  circuit 

•  Optional  external  maatar  clock 

•  Vanaty  of  voice  effects 

•  Sound  exacts 

•  Cuftomar  product  security 


o  •? 


Figure  2  SC  01  Flow  Diagram 
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Votrax  also  markets  the  Speech  PAC  (Phoneme  Access 
Controller)  which  includes  the  SC-01  chip.  This  unit 
includes  provision  for  additional  vocabulary  by  allowing  for 
storage  of  additional  phoneme  codes.  Votrax  notes  that  this 
unit  is  especially  suitable  for  inclusion  with  a  variety  of 
equipment,  controllers,  games,  etc.  The  unit  contains  an 
EPROM  circuit  which  may  be  jumpered  to  accept  a  12K  EPROM 
for  stored  vocabulary  expansion.  Phonemes  and  prestored 
words  can  be  mixed  as  desired.  Poliowing  are  Tables  40  and 
41  documenting  features  of  the  Speech  PAC  unit,  which  sells 
for  $27*S,  and  detailing  a  flow  diagram. 

The  top-end  synthesizer  in  the  Votrax  line  is  the 
Versatile  Speech  Module  (VSM/1).  The  unit  incorporates 
additional  features  over  those  in  the  Votrax  Speech  PAC  and 
sells  for  $995. 

This  unit  also  utilizes  the  SC-01  synthesizer  chip.  It 
has  a  large  lexicon  of  commonly-used  words  (industrial 
engineering  based)  stored  in  EPROM.  It  includes- a  built-in 
prefix/suffix  table  for  prestored  words.  Additional 
vocabulary  can  be  created  and  permanently  stored  on  EPROMs 
(flK  to  16K).  Other  notable  features  of  the  unit  include  a 
1,300»  word  prestored  vocabulary,  sound  effects,  variable 
stress  (4  fixed  levels,  4  transitional  inflection  levels),  8 
speech  rates,  and  8  pause  durations.  Following  are  Tables  42 
and  41  with  a  listing  of  the  VSM/1  synthesizer's  features, 
applications,  and  specifications. 


TABLE  40 


Vo* rax  Spo'-'h  P<n- 


SPEECH  PAC  ™ 

1 U)Aax' 

(Phoneme  Access  Controller) 

i 

e 

H 
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FEATURES 


•  Low  coat  complete  system 

*  Phoneme  accessing  capability  for  unlimited 
system  vocabulary 


Additional  vocabulary  specific 
to  user  needs  can  be  created 
and  permanently  stored 

Ultra  low  bit  rate  of  SC  01 
maximizes  ROM  word  storage 
capability 


True  synthetic  speech  technology 
eliminates  the  constraints  of  a  small,  fixed 
vocabulary  speech  module 


Edg*  C«rd 

ParaUgi 

mivrfaet 


Pott 
Connaclor 


•  Parallel  interface  for  computer,  controller  or 
preselected  diode  matrix  to  access  prestored 
words  or  create  phonetic  speech 


Figure  1 .  Votrex  Speech  PAC 1  “ 
(Phoneme  Access  Controller) 


•  On  board  audio  amplifier  with  volume  control 

DESCRIPTION 


APPLICATIONS 

•  Low  budget  systems  for  personal,  experimental 
or  low  volume  OEM  product  design 

•  Fixed  vocabulary  for  systems  requiring  limited 
vocabulary 

•  Add  on  speech  output  for  existing  controllers, 
educational  programs,  talking  games,  etc. 

•  Annunciators  for  alarm  systems,  elevators, 
stations,  etc. 


The  Votrex  Speech  PAC’"  introduces  a  new  level 
of  speech  synthesis  performance  and  flexibility  at 
low  cost.  Based  on  the  truly  synthetic  speech 
technology  of  the  SC-01,  the  Speech  PAC’* 
provides  the  system  designer  with  a  small,  self- 
contained  circuit  board  which  is  easily  adapted  for 
use  with  a  variety  of  equipment,  controllers, 
games,  etc. 

The  Speech  PAC'  *  is  customer  programmable  end 
expandable.  The  user  can  easily  reconfigure  the 
Speech  PAC  '  “  vocabulary,  as  desired  The 
EPROM  socket  may  be  jumpered  to  accept  a  32K 
EPROM  for  stored  vocabulary  expansion.  Pho¬ 
nemes  and  prestored  words  can  be  mixed,  as 
desired,  to  produce  an  output  with  unlimited 
vocabulary. 
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Vo  1  rax  Suoech  PAC  -  Furt-her  Spn-- ]  f  1  r,i  *  j  on  s 

SPEECH  PAC™  -  PHONEME  ACCESS  CONTROLLER 


OPERATING  CAPABILITY 

Prestored  words  are  access¬ 
ed  in  8  byte  increments. 
The  low  baud  rate  of  the 
SC-01  Speech  Synthesizer 
allows  a  single  2716 
EPROM  to  store  up  to  255 
words,  and  a  single  2532 
EPROM  to  store  up  to  511 
words  Long  phoneme 
sequences  (more  than  8 
phonemesl  may  cross  entry 
boundaries.  The  Speech 
PAC'  “  signals  the  external 
controller  at  the  end  of  each 
phoneme  sequence. 


Figure  2.  Prestored  Word  Mode 


Figure  3  Phoneme  Mode 


SPECIFICATIONS 

•  SC-01  Phoneme  Synthesizer 

•  Up  to  255  word  storage  in  a  single  2716 
EPROM 

•  Expendable  with  the  use  of  a  32K  EPROM 

•  Mixed  prestored  word/phrase  and  phoneme 
sequencing 

•  On  board  audio  amplifier 

•  Parallel  interface 

•  External  master  clock  option 

•  Handshaking  with  external  controller 

•  Unlimited  vocabulary 

•  User  custom  programmable 

•  Adaptable  for  use  with  limit  switches  and 
minimal  intelligent  controllers 
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FEATURES 


•  True  synthetic  speech  technology  and  a  built  in 
microcomputer  el  minate  the  constraints  of  a 
small  fixed  vocabulary  speech  module 

•  Ultra  low  bit  rata  of  the  SC -01  maximizes  ROM 
word  storage  capabilities 

•  Large  lexicon  of  commonly  used  words  with 
industrial  engineering  base  stored  in  EPROM 

•  Built  in  prefix/suffix  table  for  prestored  words 

•  Additional  vocabulary  can  be  created  and 
permanently  stored 

•  Phoneme  accessing  capability  for  unlimited 
vocabulary 

•  Speech  rate  and  pitch  dynamic  projTamming  for 
stress  patterns  and  simulation  o  multi-voice 
environments 

•  Sound  effects,  from  gunfire  to  musical 
sequences  can  be  easily  created  from  prestored 
sound  macros.  Additional  sound  macros  can  be 
user  defined  and  EPROM  stored  for  even 
greater  flexibility. 

•  Expandable  via  interface  ports 

•  Parallel  and  RS232  compatible  aerial  interfacing 
with  selectable  baud  rates  and  terminal  modes 

•  Foreground  and  background  simultaneous 
operation  for  speech  and  voxOS  (voice  operating 
system  I 

•  Built  in  microcomputer  can  also  simultaneously 
perform  monitoring  activities  and  execute 
speech  commands 


Figure  1.  Votrax  VSM/1' * 
(Versatile  Speech  Module) 


APPLICATIONS 

•  The  VSM/1 '  “  can  be  used  as  a  microcomputer 
to  simulate  or  develop  talking  products,  such  as 
a  talking  calculator  or  talking  games.  It  can  also 
be  used  for  unlimited  real  time  speech  synthesis 
while  simultaneously  executing  commands  and 
performing  monitoring  activities. 

»  The  VSM/1'*  can  plug  directly  into  the  card 
cage  of  an  industrial  control  computer  to  provide 
prompting  for  operating  personnel  (instructions 
for  a  real  time  situation!.  Typical  applications 
are  chemical  processing  plants,  nuclear  power 
stations,  aircraft  systems,  seismic  monitoring 
stations  and  automated  warehousing. 
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TABLK  4  1 


Vo  t  r  a  y  VSM  / 1  Synthesizer  Further  . . .  fe'.h  ions 


VERSATILE  SPEECH  MODULE™-  VSM/1 


SPECIFICATIONS 

General 

•  1,300  +  prestored  vocabulary 

•  Prefix 'suffix  modifiers 

•  Phoneme  mode 

•  Sound  effects 

•  Speech  stress 

•  Usable  as  a  general  purpose  controller/ 
simulator 

Hardware 

•  SC  01  phoneme  synthesizer 

•  Powerful  8800  MPU  (microprocessor  unit) 
based  design 

•  Parallel  and  serial  (RS232)  interface  Iselectable 
baud  rate  of  76  -  9600  bits  per  second) 

•  1 K  byte  RAM  (sockets  for  additional  2K  bytes) 

•  2K  byte  voxOS  operating  system 

•  8K  byte  prestored  vocabulary  ROM 

•  Expansion  sockets  for  an  additional  8K  bytes 
(2716)  to  16K  bytes  (2532)  of  jumper  selectable 
EPROM’s 

•  On  board  audio  amplifier,  8  ohm,  1  watt,  with 
volume  control 

•  Half  memory  plane  expansion  connector  (32K 
locations  out  of  64K.  Customer  access  to  32K 
locations  via  the  microcomputer  data  address 
bus,  I 

•  Form  compatible  with  a  popular  microcomputer 
board 

•  Variable  speech  rate  clock 

•  Variable  master  clock  frequency  circuitry  for 
pitch  control 


voxOS 

•  Full  feature  byte  oriented  editor  (insert,  delete 
change  and  move  data  pointer) 

•  Computer  and  terminal  prompting  modes 

•  Phonemes,  sound  effects,  controls  and 
prestored  speech  may  be  intermixed  in  any 
audio  sequence  memory 

•  4  audio  sequence  memories  +  1  sound  effects 
control  memory  (16  blocks  of  8  parameters  each) 

•  Memory  dump 

•  Execute  6800  operating  code  sequence  (for 
downloading  or  overriding  operating  system) 

•  12  prestored  sound  macros  (to  provide  basic 
waveshapes  for  user  selection  of  features) 

•  4  user  definable  sound  macros  (to  reside  in  user 
supplied  ROM  firmware) 

•  48  programmable  MCRC  (master  clock  resistor 
capacitor)  settings  for  continuous  dynamic 
manipulation  of  audio  parameters  (instant¬ 
aneous  course  controls) 

•  4  MCRC  transitioned  trim  controls  (slowly  step 
toward  target) 

•  voxOS  bypass  (to  jump  into  user  supplied 
firmware) 

Audio  Sequence  Commands 

•  Prestorad  speech  callout  ( 16K  byte  direct  access 
range) 

•  Two  phoneme  execution  modes  (fixed  inflection 
and  transitioned  inflection) 

•  4  fixed  inflection  levels  (instant) 

•  4  transitioned  inflection  levels  (step) 

•  16  sound  effect  (commands)  control  blocks 
(load  control  memory  and  pick  1  of  the  16) 

•  8  speech  rates  (will  not  affect  sound  effects) 

•  8  pause  durations 

•  8  prompting  sounds  (canned  sound  effects) 

•  Prestored  prefix/suffix  word  modifiers 
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The  final  Votrax  product  that  5(~ PL  reviewed  is  the 
Type-’N-Talk  text-to-speech  synthesizer.  The  unit  costs 
*375.  To  use  this  system,  words  are  typed  into  a  host 
terminal  and  translated  into  synthesized  speech  by  the 
system’s  microprocessor-ba  sed  text-to-speech  algorithm.  The 
unit  incorporates  the  SC-0  1  chip.  It.  includes  a  1-watt 
amplifier,  RS  232C  interface,  data  echo  of  ASCII  characters, 
and  phoneme  access  modes.  Table  44  provides  a  basic 
description  of  the  Type-’N-Talk  unit. 

Note  that  Votrax  has  just  announced  a  second 
text-to-speech  synthesizer,  with  reportedly  better  voice 
guality  than  their  Type-’N-Talk  unit.  This  is  the  SVA 
te xt-to-speech  unit,  which  sells  for  approximately  *1,650. 
The  unit  is  available  with  a  16K  buffer  (approximately  15 
characters  per  second)  ,  which  will  hold  up  to  800 
characters.  This  Votrax  unit  is  also  a  rule  synthesizer. 

5.2  KLATT  SYNTHESIZER  PROGRAM 

This  subsection  deals  with  the  Klatt  synthesizer 
program.  This  is  not  yet  a  marketed  synthsizer  as  were  the 
ones  discussed  in  the  proceeding  section.  The  March,  1980, 
Journal  of  the  Acoust ical  Societ  y  of  A  mer ica  contained  an 
article  on  Dr.  Klatt’s  parametric  (rule)  synthesizer.  This 
synthesizer  is  the  most  advanced  rule  synthesizer  to  date, 
and  output  from  Klatt’s  synthesizer  is  virtually 
indistinguishable  from  live  speech,  given  the  appropriate 
parameter  input. 
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Vot  MX  TV|J(’  -  '  N-  Ta  I  k  Tox* 


to  .  h 


£;  y  n  t  hi>  s 
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Tba  netting  taxt-to-speech  indhedw 
Hurt  Kj««  mry  computer  talking. 


•  VmHm 


701*100 


JVpe-TMalk."  an  important  technological 
advance  trots  Votrat  enables  your  com- 
puter  to  talk  to  you  liBpir  and  clearly  — 
with  an  unlimited  vocabulary  You  can 
enter  the  many  features  ol  Type- 71-Talk." 
tha  now  tm-to-rpaodi  synthesiser,  tor 
iuat  $373.00. 

You  operate  Type- 71-Talk ’by  tuaply  typ¬ 
ing  English  taxi  and  a  talk  command 
Your  typewritten  word*  ara  automatically 
translated  into  slectroeic  apaech  by  tha 
ryutatn  •  microprocessor  baaed  text-to- 
speech  algorithm 


Type-NTalh',,adde  a  whola  new  world  o 1 
ipaaking  roiaa  to  your  computer  You  can 
program  verbal  reminders  to  prompt  you 
through  a  complex  routine  and  make  your 
computer  announce  events  In  teaching, 
the  computer  with  Type-  Tf-Talk  "can 
actually  tad  student*  whan  they're  nght 
or  wrong  —  even  prataa  a  correct  answer 
And  ol  couree.  Type- 71-Talk  "is  great  htn 
tor  coaputar  gamaa.  Your  gamaa  coma  to 
Ida  with  spoken  threats  oi  danger,  re¬ 
minder*,  and  praiaa.  Now  all  computet* 
can  apeak.  Hake  your*  one  ol  the  hrtt. 

liah  text  ii  automatically  tranalatad 

•led radically  synthesised  apaech 
..ids  Type-  N-Talk.’  ASCII  coda  from 
/our  computer'!  keyboard  ta  led  to 
Type-  N-Taik  "through  an  AS  232C  Inter¬ 
lace  to  generate  syntheaued  apaech. 
lust  enter  English  teat  and  hear  the  verbal 


reaponse  (electronic  speech)  through  your 
audio  loud  speaker  For  example:  amply 
type  the  ASCII  character*  representing 
'h  a-l-l-o’’  to  generate  the  spoken 
word  'hallo  " 


TYPE-’N-TALK’ 


Tfpa- 71-Talk  ’has  its  own  buih-in  nucn> 
procemor  and  a  790  character  buttst  to  hold 
tha  words  you’ve  typed.  Evan  the  Waal  last 
computer  can  asecuta  programs  and  ^aak 
nmuhanaoudy.  Type-Ti-Talk  ’doawat  have 
to  tee  your  bod  computer's  memory,  or  tie  d 
up  with  tuna-consuming  text  trenslahrm. 


_  iiwMiniXi 

Plan  Typ*-'NT«i*'b«r»Mii  *  computet 
oi  madam  ud  »  ttimumJ-  Trp»-'NT«lk  ' 
cu  ipwi  «U  da  to  mu  to  thd  tontunol 

Millao  uttk  A  mmrmltr  loiftniuHaB 


randomly  accented  trom  a  date  b«M  can 
be  verbalised  Using  the  Type-  N-Talk  ” 
date  switching  capability,  the  unit  can  be 
‘de-seiectad"  while  date  is  sent  to  the  ter¬ 
minal  and  vice- versa  —  permitting  speech 
and  visual  data  to  be  indepmideoUy  sent 
on  a  single  date  channel - 


Type- 'N-Talk  " can  be  Interlaced  in  aeveral 
way*  using  special  control  character*. 
Connect  it  directly  to  s  computer's  sen el 
interlace.  Then  a  terminal.  line  printer,  or 
Type-H  Talk 'unite  can  be 
connected  to  the  hrst  Type-7l-Taik,’ 
eliminating  the  need  for  additional 
RS-232C  ports  on  your  computer. 

(/stag  uiut  aangnmeot  codes,  multiple 
Type- 71 -Talk  "units  can  be  daisy-chained. 
Unit  addressing  codes  allow  independent 
control  oi  Type- 71-Talk" units  and 
your  printer. 


Laafc  wfcal  yaa  gat  Ur  S37S.0Q. 
TV  PE-  A-TALK  mm  wttki 

•  Text -to- speech  algorithm 

•  A  one-watt  audio  ampUher 

•  SC -01  speech  ryotheeixer  chip  (data 
rate:  70  to  100  bits  par  second) 

•  790  character  butter 

•  Date  switching  capability 

•  Selectable  data  modes  for  versatile 
interlacing 

•  Baud  rate  (73-9600) 

•  Date  echo  ol  ASCII  character* 

•  Phoneme  access  modaa 

•  RS  232C  interface 

•  Complete  programming  and  installation 
instructions 

The  Votrax  Type- 71-Talk "  is  one  ol  the 
eesiMt -to- program  speech  syntheeUer*  on 
the  market.  It  uses  the  least  amount  ol 
memory  and  it  gives  you  tha  moat  flexible 
vocabulary  available  anywhere. 


MwMW.ToUfrM. 

!■■■■■■■■■! 

|  Call  the  toil -bee  number  below  to  | 

■  order  or  request  additional  inior  ■ 
matron.  MasterCard  or  Visa  — 

M  accepted  Charge  to  your  credit  * 
■  card  or  send  a  check  for  $379  00  jl 
_  plus  $4.00  delivery  Add  4H  aalea  m 
m  tax  in  Michigan.  j® 

a  1-800-521- 135a  I 

/  ehax 
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(313)  9S8-0341 
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Klatt’s  synthesizer  provides  minute  control  over  the 
aain  parameters  underlying  human  speech.  Consequently, 
anything  can  potentially  be  synthesized,  given  the  correct 
paraaeter  input.  This  type  of  synthesizer  mill  certainly 
see  wide  coaaercial  application  in  the  future.  Figure  « 
gives  a  flow  diagram  of  the  Klatt  synthesizer. 

The  synthesizer  is  a  cascade/parallel  formant 
synthesizer  as  shown  in  the  top  schematic  of  Figure  5.  The 
two  aain  coaponents  of  the  synthesizer  are  the  cascade 
portion  and  the  parallel  portion;  this  aaounts  to  a 
combination  of  the  two  coaaon  types  of  ezperiaental 
synthesizers  widely  seen  in  the  literature. 

Parallel  synthesizers  which  are  essentially  formant 
resonators  that  simulate  the  transfer  function  of  the  vocal 
tract,  connected  in  parallel,  ate  of  the  type  shown  in  the 
botton  schematic  of  Figure  5.  Each  formant  resonator  is 
preceded  by  an  amplitude  control  that  determines  the 
relative  amplitude  of  a  formant  in  the  output  spectrum  for 
both  voiced  and  voiceless  speech  sounds.  The  cascade 
configuration  is  noted  ty  Dr.  Klatt  to  have  the  advantage  of 
having  the  relative  amplitudes  of  formants  automatically 
computed  without  the  need  for  individual  amplitude  controls 
for  each  formant.  The  disadvantage  of  cascade  synthesizers 
is  that  one  still  needs  a  parallel  formant  configuration  for 
the  generation  of  fricatives  and  plosives.  This  is  due  to 
the  fact  that  the  vocal  tract  transfer  function  cannot  be 
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GENERAL-PURPOSE  DIGITAL  COMPUTER 


Figure  4:  Flow  diagram  of  Klatt*s  synthesizer. 
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Figure  5:  Configurations  for  cascade/parallel  formant 
synthesizers  and  for  special-purpose  all-parallel 
formant  synthesizers. 
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Modeled  adequately  by  five  cascade  resonators  when  the 
source  sound  is  above  the  larynx.  So,  overall  cascade 
synthesizers  tend  to  be  relatively  more  complex.  A  second 
advantage  of  the  cascade  configuration  is  that  it  is  a  aore 
accurate  aodel  of  the  vocal  tract  transfer  function  during 
the  production  of  non-nasal  voiced  sounds.  Also,  it  is 
difficult  to  aatch  the  transfer  function  of  certain  vowels 
using  a  parallel  foraant  synthesizer. 

Klatt*  s  synthesizer  uses  two  voicing  sources,  one  for 
periodic  sounds  and  one  for  nonperiodic  or  turbulent  sounds 
(such  as  for  fricatives).  The  Klatt  synthesizer  has  a 
sampling  rate  of  10000  bps,  as  speech  does  not  have  much 
energy  above  5000  Hz.  and  low-pass  filtered  speech  sounds 
perfectly  natural. 

Klatt 's  synthesizer  has  a  set  of  10  control  parameters 
which  are  used  for  synthesis;  as  many  as  20  of  these 
parameters  may  be  used  for  English  utterances.  The  Klatt 
synthesizer  basically  uses  the  parameters  for  input, 
functioning  as  a  digital  resonator.  Tables  «5  and  46  list 
variable  parameters  and  sample  parameters. 

Spectrograms  are  used  by  Klatt  as  a  model  to  determine 
the  general  acoustic  characteristics  of  the  utterance  to  be 
synthesized.  Spectrograms  of  natural  speech,  compared  with 
synthesized  speech,  show  just  how  well  the  Klatt  synthesizer 
models  human  speech.  Figure  6  displays  a  comparison  between 
a  natural  utterance  and  a  synthesized  utterance. 
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TABLE  45 

Klatt  Synthesizer  Variable  Parameters 


TABLE  List  of  control  parameters  for  the  software  formant  synthesizer . 

The  second  column  Indicates  whether  the  parameter  Is  normally  constant  1C) 
or  variable  IV)  during  the  synthesis  of  English  sentences.  Also  listed  are 
the  permitted  range  of  values  for  each  parameter ,  and  a  typical  constant 
value. 


N 

V/C 

Sym 

Name 

Min 

Max 

Typ 

1 

V 

A  V 

Amplitude  of  voicing  ( dB ) 

0 

80 

0 

2 

V 

AF 

Amplitude  of  fricatlon  (dB) 

0 

80 

0 

1 

V 

AH 

Amplitude  of  aspiration  (dB) 

0 

80 

0 

4 

V 

A  VS 

Amplitude  of  sinusoidal  voicing  (dB)  0 

80 

0 

5 

V 

FO 

Fundamental  fraq.  of  voicing  (Hz) 

0 

500 

0 

6 

V 

FI 

First  formant  frequency  (Hz) 

150 

900 

4  50 

7 

V 

F2 

Second  formant  frequency  (Hz) 

500 

2500 

1450 

a 

V 

F3 

Third  formant  frequency  (Hz) 

1300 

3500 

2450 

9 

V 

F  4 

Fourth  formant  frequency  (Hz) 

2500 

4  500 

3300 

10 

V 

FNZ 

Nasal  zero  frequency  (Hz) 

200 

700 

250 

11 

c 

AN 

Nasal  formant  amplitude  (dB) 

0 

80 

0 

12 

c 

A  1 

First  formant  amplitude  (dB) 

0 

80 

0 

13 

V 

A2 

Second  formant  amplitude  (dB) 

0 

80 

0 

14 

V 

A3 

Third  formant  amplitude  (dB) 

0 

80 

0 

15 

V 

A4 

Fourth  fromant  amplitude  (dB) 

0 

~  80 

0 

16 

V 

A5 

Fifth  formant  amplitude  (dB) 

0 

80 

0 

17 

V 

A  6 

Sixth  formant  amplitude  (dB) 

0 

80 

0 

18 

V 

AB 

Bypass  path  amplitude  (dB) 

0 

80 

0 

19 

V 

81 

First  formant  bandwidth  (Hz) 

40 

500 

50 

20 

V 

82 

Second  formant  bandwidth  (Hz) 

40 

5  00 

70 

21 

V 

83 

Third  formant  bandwidth  (Hz) 

40 

500 

110 

22 

c 

5  W 

Cascade/parallel  switch  OICASC) 

l(PARA) 

0 

23 

c 

F  QP 

Glottal  resonator  1  frequency  (Hz) 

0 

600 

0 

24 

c 

BGP 

Glottal  resonator  1  bandwidth 

100 

2000 

100 

25 

c 

FEZ 

Glottal  zero  frequency  (Hz) 

0 

5000 

1500 

26 

c 

BCZ 

Glottal  zero  bandwidth  (Hz) 

100 

9000 

6000 

27 

c 

84 

Fourth  formant  bandwidth  (Hz) 

10  0 

500 

250 

28 

V 

F5 

Fifth  formant  frequency  (Hz) 

3500 

4  900 

3  750 

29 

c 

85 

Fifth  formant  bandwidth  (Hz) 

150 

700 

200 

30 

c 

F6 

Sixth  formant  frequency  (Hz) 

4000 

4  999 

4  900 

31 

c 

86 

Sixth  formant  bandwidth  (Hz) 

200 

2000 

1000 

32 

c 

FNP 

Nasal  pole  frequency  (Hz) 

200 

500 

250 

33 

c 

BNP 

Nasal  pole  bandwidth  (Hz) 

50 

500 

100 

34 

c 

BNZ 

Nasal  zero  bandwidth  (Hz) 

50 

500 

100 

35 

c 

80S 

Glottal  resonator  2  Bandwidth 

100 

1000 

200 

36 

c 

SB 

Sampling  rate 

5000 

20000 

10000 

37 

c 

NWS 

No.  of  waveform  samples  per  chunk 

1 

200 

50 

38 

c 

GO 

Overall  gain  control  (dB) 

0 

BOO 

47 

39 

c 

NFC 

Number  of  cascaded  formants 

4 

6 

5 

126 


TABLE  46 


Klatt  Synthesizer  Sample  Parameters 


flat-  cascaoe /parallel  format  synthesizer 

THE  FOILING  TABLE  REPRESENTS  THE  CONFIGURATION  FOR  THE  CURRENT  PARAMETER  FILE. 


HUM 

farm 

V/C 

VALUE 

NUM 

FARM 

V/C 

VALUE 

NUM 

PARM 

V/C 

VALUE 

1 

A  V 

1 

0 

14 

A3 

1 

SO 

27 

84 

0 

3  400 

2 

AF 

I 

0 

15 

A4 

1 

SO 

28 

F5 

0 

3  700 

3 

AH 

1 

0 

IS 

A5 

1 

0 

29 

85 

0 

5  00 

4 

A  VS 

1 

0 

1  7 

A6 

1 

0 

30 
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Time 


Broadband  spectrograms  are  compared  of  a  natural 
and  synthetic  word,  "string,”  spoken  by  a  female  talker. 


Figure  6:  natural  utterance  compared  to  Klatt’s 
synthesized  utterance. 
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Klatt  states  the  usefulness  of  the  linear  prediction 
spectrum.  To  obtain  this  spectrum,  a  linear  prediction 
analysis  precedes  a  discrete  Fourier  transform.  The 
autocorrelation  alogrithm  (Nakhoul,  1975)  using  14  poles,  is 
applied. 

Tt  is  expected  that  Klatt' s  current  rule  synthesizer 
will  be  fully  incorporated  into  future  commercial  products, 
with  appropriate  control  software.  k  synthesizer 

incorporating  Klatt's  latest  synthesis  would  probably 
include  storage  capabilities  for  control  parameters. 
Finally,  it  is  expected  that  once  Klatt's  current  synthesis 
program  is  incorporated  into  a  commercial  synthesizer,  it 
should  provide  very  serious  competition  for  currently 
available  synthesizers,  because  the  output  is  often  not 
easily  distinguishable  from  live  speech. 
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CHAPTER  6 


CONCLUSIONS  FOP  USE  OF  SPEECH  RECOGNITION 
AND  SPEECH  SYNTHESIS  TEC  HN  TQ  UFS 

This  chapter  examines  and  compares  different  speech 
recognition  and  speech  synthesis  technologies  as  they  relate 
to  Coast  Guard  operational  and  technical  requirements .  It 
also  discusses  problems  involved  in  developmental  efforts 
for  speech  recognition  and  synthesis.  Finally,  it  proposes  a 
future  plan  for  using  speech  synthesis  for  broadcasting 
Coast  Guard  weather  reports. 

6.1  SPEECH  RECOGNITION  TECHNOLOGY 

Coast  Guard  operational  and  technical  requirements  lead 
us  to  the  conclusion  that  the  Coast  Guard  requires  a  totally 
speaker-independent  speech  recognition  system  capable  of 
spotting  keywords  in  connected  speech.  In  particular,  the 
Coast  Guard  has  considered  speech  recognition  as  a  potential 
means  to  back  up  watch  standers  in  guarding  distress 
frequencies.  Application  areas  mentioned  by  the  Coast  Guard 
included:  Communications  Stations,  Radio  Stations,  Group 
Stations,  Search  and  Rescue  Stations,  and  Coast  Guard 
Cutters.  Coverage  is  for  21B2  kHz  MF  radiotelephone,  and 
1r»6.B  mHz  radiotelephone  (Channel  16).  Re  pointed  out  that 
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our  analysis  of  selected  Coast  Guard  radio  transmissions 
showed  that  the  Coast  Guard  needs  a  recognition  system 
capahle  of  handling  transmissions  with  a  relatively  Low 


signal- to-noise  ratio  (nean  S/N  ratio  was  23dB,  with  a 
standard  deviation  of  5  dB.  ),  and  with  cutoff  frequencies 
ranging  from  approximately  300-4000  Hz.  Additional 

technical  requirements  make  its  speech  recognition 
requirements  even  more  stringent.  For  example,  we  noted 
that  Coast  Guard  requirements  regarding  keyword  spotting 
indicate  that  a  recognizer  should  be  able  to  handle 
connected  speech  input  with  widely  differing  emotional 
states,  diverse  accents,  and  substantial  nonperiodic 
background  noise  input. 

As  noted  previously,  5CRL  was  able  to  compile  detailed 
information  regarding  speech  recognition  products  from  nine 
major  manufacturers: 

1)  Centigram 

2)  Heuristics 

3)  Interstate  Electronics 

4)  Nippon  Electric  Company 

5)  Scott  Instruments 

6)  Threshold  Technology 

7)  Verbex 

3)  Voice  tek 

3)  Votan 


As  mentioned  earlier,  there  are  no  currently  available 
recognizers  which  can  handle  connected  speech  in  a 
completely  speaker-independent  manner.  We  did  point  out 
that  three  manufacturers  market  recognizers  capable  of 
handling  connected  speech.  However,  of  these  three 
manufacturers  only  one  markets  a  speaker-independent 
recognizer  (capable  of  recognizing  digits  plus  50  optional 
words).  Yet  this  particular  recognizer  will  not  handle 

connected  speech  input.  From  our  discussions  with 

manufacturers  and  review  of  ongoing  work  related  to  speech 
recognition  technology,  we  note  that  there  is  a  large  effort 
being  devoted  to  the  task  of  developing  speaker- independent 
recognition  systems  capable  of  handling  connected  speech 
input. 

In  line  with  a  general  price  reduction  in  speech 
recognition  systems  due  to  improved  technology  and 
manufacturing  techniques,  we  believe  that  the  price  of 
future  speaker-independent  recognizers  capable  of  handling 
continuous  speech  input  will  be  lower  than  might  first  be 
imagined,  probably  on  the  order  of  S20K,  depending  upcn  the 
size  of  the  recognition  vocabulary.  It  can  also  be  pointed 
out  that  first-generation  speaker-independent,  continuous 
speech  input  voice  recognizers  will  handle  the  digits 
primarily,  plus  several  control  words.  This  configuration 
appears  to  have  wide  marketing  possibilities  related  to 
business  usage  over  the  telephone. 
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It  is  interesting  that  three  recognizers  which  would  be 
closest  to  meeting  Coast  Guard  requirements  involve  a 
generally  similar  approach  to  voice  recognition.  Each  unit 
digitally  encodes  voice  input  samples  for  comparison  with 
stored  ’•templates";  the  approach  is  to  correlate  stored 
filter  coefficients  or  other  stored  "template"  data  with 
incoming  speech  samples.  Given  an  appropriately  high 

correlation,  a  match  occurs,  the  word  is  recognized,  and 
appropriate  ASCII  messages  are  output  from  the  recognition 
unit.  This  ASCII  output  can  be  used  to  define  a  variety  of 
instructions  as  required.  Only  one  of  the  recognizers  does 
not  require  "training",  where  specific  speakers  follow  a 
predefined  sequence  for  encoding  their  recognition 
vocabulary  "templates." 

In  terns  of  meeting  specific  Coast  Guard  operational 
and  technical  requirements  regarding  spotting  of  keywords  in 
incoming  distress  signals  ("nayday",  "sinking",  etc.),  we 
note  that  the  Verbex  1800  comes  closest.  Again,  we  note 
that  at  least  one  device  might  be  close.  This  recognizer  can 
handle  up  to  50  words  plus  the  digits,  in  a 
speaker-independent  mode  over  the  telephone  (and  with 
accompanying  adverse  noise  conditions)  with  high  accuracy, 
but  it  cannot  handle  connected  speech  input.  It  should  be 
pointed  out,  however,  that  there  may  he  a  possibility  of 
using  a  speaker-independent,  isolated  word  recognizer  for 
meeting  Coast  Guard  speech  recognition  requirements. 
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Although  all  sample  radio  transmissions  which 

acoustically  evaluated  hy  SCRL  involved  connected  speech,  we 

do  have  the  impression  that  mariners  generally  pronounce 

distress  signals  slowly,  and  repeat  keywords  such  as 

"mayday".  If  this  is  generally  the  case  with  incoming  Coast 

Guard  distress  calls,  an  isolated  word  recognizer  could  be 

expected  to  successfully  recognize  a  high  percentage  of 

keywords  in  Coast  Guard  distress  signals.  We  should  point 

out  that  there  remain  several  uncertainties  regarding  such 

an  approach.  One  important  consideration  would  be  how  false 

recognitions  might  be  generated  hy  connected  speech  input 

1 

surrounding  the  "isolated"  keywords  to  be  spotted. 

There  are  at  least  two  further  points  to  note  regarding 
Coast  Guard  station  automation  plans  and  the  possibility  of 
using  speech  recognition  to  spot  keywords  in  incoming 
distress  signals.  First,  there  is  the  basic  consideration 
as  to  what  level  of  speech  recognition  product  might  most 
easily  be  integrated  into  Coast  Guard  station  automation 
plans.  Second,  is  the  consideration  as  to  the  relative  cost 
of  different  levels  of  speech  recognition  devices. 

As  noted  previously,  the  re  are  several  levels  of  speech 
recognition  products  widely  seen  in  today’s  commercial 
market,  including  board- level  recognition  products  and 
stand-alone  devices.  Board-level  recognition  products  ace 
most  suitable  for  installations  which  are  already  equipped 
with  a  host  computer  capable  of  accepting  the  common  RS232C 
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interface.  Stand-alone  recognizers  are  most  suited  to 
installations  which  do  not  hare  a  host  computer  with  enough 
storage  to  handle  speech  recognition. 

It  should  be  pointed  out  that  stand-alone  speech 
recognition  units  tend  to  be  relatively  expensive,  since 
they  are  generally  tuilt  around  an  existing  minicomputer. 
Wherever  this  minicomputer  can  be  used  for  tasks  in  addition 
to  speech  recognition  (such  as  wcrd  processing,  data 
storage,  etc.),  it  makes  a  more  cost-efficient  package  than 
if  it  operates  only  as  a  speech  recognition  hcst. 

6.2  SPEECH  SYNTHESIS  TECHNOLOGY 

As  pointed  out  previously,  SCFL  is  very  optinistic  that 
presently  available  speech  synthesis  technology  exists  which 
is  fully  capable  of  meeting  Coast  Guard  operational  and 
technical  requirements  for  synthesis  of  weather  reports, 
notices  to  mariners,  hydrographic  information,  and  other 
desired  broadcasts.  Application  areas  include  Communication 
Stations,  Eadio  Stations,  Group  Stations,  and  Search  and 
fescue  Stations.  We  recommend  that  the  Coast  Guard  consider 
using  both  an  analysis  synthesizer  to  obtain  natural 
sounding  speech  and  also  an  ASCII- prompted  (for  incoming 
weather  reports  via  teletype)  rule-type  "Lext  -to-speech" 
speech  synthesizer  for  future  efficiency  and  extended 
ability. 

It  should  be  pointed  out  that  the  coast  Guard  requires 
a  speech  synthesizer  with  an  essentially  unlimited 
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vocabulary  (for  names  of  ships,  storms,  etc.).  Coast  Guard 
speech  synthesis  also  needs  high-quality  audio  output. 
Synthesized  Coast  Guard  broadcasts  should  have  good 
prosodies,  such  as  realistic  sentence  intonation,  and  a 
variety  of  voices.  A  Coast  Guard  consideration  has  been 
that  advanced  technologies,  such  as  speech  synthesis,  offer 
the  possibility  of  conserving  manpower,  and  thus  saving 
operating  funds. 


6.1  DEVEICFPIFir AL  EFFORTS  FOR  SPEECH  RECOGNITION  TECHNOLOGY 
As  this  report  has  mentioned,  there  are  no  currently 
available  speech  recognizers  which  can  handle  connected 
speech  in  a  speaker-independent  mode.  A  variety  of 
manufacturers  are  now  developing  this  type  of  speech 
recognition  system,  for  it  has  such  a  diversity  of  potential 
markets. 

Manufacturers  point  out  that  the  development  of  a 
completely  speaker-independent  recognizer  that  would  handle 
connected  speech  involves  several  technical  problems  which 
have  not  yet  been  fully  solved.  These  problems  include: 

1)  Need  for  segmentation  programs  which  correctly 
determine  the  words  or  phrases  to  be  matched  with 
stored  templates. 
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2)  Meed  for  better  time  warping  atogrithms  to  aore 
accurately  natch  reference  templates  with  input  data. 

3)  Need  for  better  alogrithas  for  relating  essentially 
unique  or  idiosyncratic  acoustic  manifestations  to 
common  reference  templates,  including  both  phonetic  and 
prosodic  phenomena. 

Tn  terms  of  Coast  Guard  developmental  efforts  in  the 
area  of  speech  recognition,  this  report  suggests  that  the 
above  problems  are  exceedingly  difficult  and  commercial 
manufacturers  are  now  working  on  them.  He  feel  that 
developmental  efforts  the  Coast  Guard  might  make  in  this 
area  would  closely  parallel  those  of  commercial 
manufacturers  of  recognizers  and  would  not  be  cost 
e  f  feet ive. 

Speaker-independent  recognition  with  connected  speech 
should  not  be  that  far  distant  by  current  manufacturers.  He 
have  already  pointed  out,  for  example,  that  one  recognizer 
will  handle  isolated  digits  and  control  words,  plus  up  to  50 
selected  optional  vocabulary  items,  in  a  completely 
speaker-independent  mode  over  the  telephone.  Several 
nanu  facturers  of  recognizers  already  market  recognizers 
which  will  handle  connected  speech  (approximately  180  words 
per  minute).  Speech  recognition  technology  is  very 

dependent  on  the  LSI  chip  industry,  ^s  prices  for  LSI  chips 
continue  to  decrease,  we  can  expect  to  see  improved 
performance  from  commercial  speech  recognition  devices.  Hith 
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this  in  Bind,  one  can  readily  envision  a  completely 
speaker-independent  recognizer  which  will  handle  connected 
speech  in  the  not-too-distant  future.  For  this  reason,  we 
suggest  that  the  Coast  Guard  need  not  sponsor  the 
development  of  a  special  purpose  speaker-independent 
connected  speech  recognizer  for  their  applications. 

One  area  where  the  Coast  Guard  should  concentrate  its 
efforts  concerns  the  overall  planning  strategy  for  station 
automation  reguireaents.  Several  levels  of  speech 

recognition  products  currently  exist,  for  example,  board 
products  and  complete  stand-alone  products.  To  easily 
integrate  speech  recognition  technology  in  automation 
planning,  the  Coast  Guard  should  consider  what  kind  of  an 
approach  it  will  be  taking  with  regard  to  computer  selection 
and  implementation.  A  key  point  to  consider  concerns  the 

question  of  how  much  computing  capability  is  required  to 
meet  Coast  Guard  requirements  for  automation  of  its 
facili  ties. 

6.4  DEVELOPMENTAL  EFFORTS  FOR  SPEECH  SYNTHESIS  TECHNOLOGY 

As  we  have  already  stated,  speech  synthesis  technology 
currently  exists  which  appears  capable  of  meeting  Coast 
Guard  operational  and  technical  requirements  related  to  its 
broadcast  requirements.  Both  an  analysis  synthesizer  and  a 
rule-based  text- to- speech  synthesizer  would  provide  a 
convenient  Beans  of  preparing  Coast  Guard  weather  reports. 
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h  ydrographic  information,  notices  to  mariners,  safety 
messages,  and  other  desired  broadcasts.  Consequently,  there 
is  no  real  need  for  the  Coast  Guard  to  initiate 
developmental  efforts  regarding  speech  synthesis  systems  for 
aeeting  its  broadcast  requirements.  Applications  studies 
are  what  would  be  recommended. 

As  with  speech  recognition  technology,  speech  synthesis 
technology  is  heavily  tied  to  the  economics  of  the  LSI  chip 
industry.  Recent  years  have  seen  a  real  decline  in  prices 
for  LSI  chips  and  an  increase  in  their  capabilities.  This 
trend  is  expected  to  continue,  so  that  speech  synthesis 
technology  will  become  even  more  attractive,  not  only  in 
terms  of  performance,  but  in  terms  of  price  as  well. 

We  do  suggest  that  the  Coast  Guard  consider  speech 
synthesis  technology  in  the  framework  of  its  overall 
automation  planning  requirements,  so  that  this  technology 
can  easily  integrate  with  the  overall  Coast  Guard  computing 
requirements  for  station  automation  planning. 

6.5  SPEECH  RECOGNITION  COST  EFFECTIVENESS 

Speech  recognition  technology  has  generally  been  most 
cost  effective:  l)in  terms  of  the  manner  in  which  it 

increases  the  efficiency  of  human/equipment  or  man/nachine 
interactions,  2)  through  its  ability  to  automate  procedures 
that  previously  required  human  operators,  and  3)  in  terms  of 
its  overall  efficiency  in  recording  data  from  human 
operators. 
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Speech  recognition  technology  has  been  cost 
effective  for  humans  who  are  employed  in 
situations  where  their  hands  are  occupied, 
but  they  must  still  record  various  types  of 
data.  For  example,  speech  recognition  has 
been  used  to  facilitate  data  entry  from 
humans  who  are  performing  various  types  of 
inspection  procedures  which  require  the  use 
of  both  hands,  such  as  inventory  accounting 
and  cartographic  analysis. 

Speech  recognition  technology  has,  in  various 
instances,  replaced  human  operators 

altogether  where  procedures  are  to  be 
initiated  upon  simple  verbal  commands.  An 
example  of  this  would  be  situations  where 
companies  receive  incoming  phone  requests  for 
certain  types  of  information,  as  with  stock 
brokerage  firms  that  typically  receive 
numerous  reguests  for  quotations  on 
securities.  In  such  situations,  speech 
recognition  technology  has  proven  its  ability 
to  eliminate  human  operators,  and  to  provide 
required  information  over  phone  lines  based 
upon  user  prompting  via  alphanumeric  input. 
To  this  point,  only  one  manufacturer  has 
provided  customers  with  a  completely 


speaker-independe nt  speech  recognition  system 
capable  of  providing  this  kind  of  telephone 
service. 

3)  Voice  input  of  data  is  a  highly  effective 
means  of  obtaining  data  from  humans,  as 
opposed  to  data  entry  via  keyboard  which  is  a 
much  slower  process.  Speech  recognition 
technology  has  also  proven  its  ability  to 
offer  a  very  efficient  means  of  entering 
commands  to  computers.  A  basic  illustration 
of  the  effectiveness  of  speech  recognition 
technology  involves  the  fact  that  humans, 
too,  can  more  efficiently  respond  to  voice 
commands,  as  opposed  to  visual  or  ether  forms 
of  prompting. 

As  viewed  by  the  Coast  guard  in  its  Statement  of  Work 
for  this  project,  speech  recognition  technology  appears  to 
be  most  applicable  as  a  means  of  assisting  human  operators 
who  monitor  distress  frequencies.  In  this  sense,  speech 
recognition  technology  would  be  intended  not  so  much  to 
replace  all  human  operators,  but  to  provide  a  low-cost 
assistance  in  guarding  distress  frequencies.  At  this  level 
speech  recognition  technology  would  not  actually  reduce 
front-line  operating  expenses,  but  would  instead  be  designed 
to  increase  the  Coast  guard's  overall  efficiency  in 
monitoring  distress  signals. 
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Speech  recognition  could  offer  cost,  savings  to  back  up 
operators  monitoring  distress  signals.  Bherever  personnel 
are  now  used  to  back  up  these  operators,  speech  recognition 
systems  could  potentially  eliminate  those  individuals  and 
free  them  for  other  duties.  Multi-channel  recognizers 
already  exist  and  future  generations  of  recognizers  will 
likely  continue  this  capability.  Given  this  assumption,  a 
single  recognition  unit  could  be  used  to  back  up  several 
operators  through  its  capability  tc  monitor  keywords  cn 
several  channels  simultaneously. 

Concluding,  speech  recognizers  do  offer  definite  cost 
savings  advantages  in  numerous  situations.  However,  in 
terms  of  the  Coast  Guard’s  application  for  keyword  spotting 
as  a  means  of  backing  up  human  coverage  of  distress 
frequencies,  its  main  advantage  lies  net  in  terms  of  its 
cost  effectiveness,  but  in  terms  of  its  overall  potential  to 
provide  low-cost  effective  back-up  to  the  Coast  Guard’s 
monitoring  of  distress  signals. 

6.6  SPEECH  SYNTHESIS  COST  EFFECTIVENESS 

This  report  has  already  noted  that  speech  synthesis  not 
only  saves  on  manpower  required  for  meeting  broadcast 
requirements,  but  is  also  an  ideal  technology  for  achieving 
a  more  fully  automated  broadcast  facility.  As  an  example  of 
this  sort  of  automation,  again  refer  to  the  Coast  Guard 
weather  reports  which  are  received  via  teletype.  Ey  using 
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ASCII  prompting  as  input  to  a  voice  synthesizer,  weather 
reports  could  be  prepared  and  stored  for  broadcast  by 
computer.  Heather  reports  would  then  consist  of  stored 
ASCTT  input  for  producing  the  required  broadcasts  on  a 
speech  synthesizer.  This  method  would  entirely  eliminate 
soundproof  booths,  storage  on  analog  tape,  variation  in 
microphone- to- mouth  distance,  and  much  of  the  time  it  now 
takes  operators  to  prepare  weather  reports.  Since  speech 
synthesis  technology  is  also  an  entirely  solid-state 
technology,  it  should  also  increase  the  reliability  of  the 
Coast  Guard’s  broadcast  system. 

The  overall  degree  of  cost  savings  resulting  from  the 
use  of  speech  synthesis  technology  will  vary,  depending  upon 
the  manner  in  which  it  is  used.  For  example,  many 
applications  have  used  speech  synthesis  to  aid  in  setting  up 
efficient  man/machine  interactions.  Other  applications  have 
used  speech  synthesis  technology  to  entirely  replace  human 
operators. 


6.7  EFFECTIVE  TJSE  OF  SPEECH  SYNTHSIS  IN  COAST  GOAHD 
BPOAOCA  STS 

The  first  and  most  effective  use  of  speech  synthesis 
would  be  to  broadcast  the  Coast  Guard  Heather  reports.  It 
is  suggested  that  the  Coast  Guard  might  collaborate  with  the 
National  Heather  Service  ( NHS)  in  a  joint  effort  to  automate 
this  service.  This  would  be  in  line  with  a  Task  Feport 
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prepared  by  the  NWS  od  creating  a  sample  vocabulary  of  words 
and/or  phrases  suitable  for  use  with  automat  H  speech 
generation  systems.  'rhe  NWS  previously  ha  1  evaluate1,  systems 
catable  of  providing  an  automated  readout  of  computer 
generated  weather  reports,  particularly  ttose  use^  on  NCAA 
weather  Padio.  The  NWS*  had  also  tested  the  putlic's 
reaction  to  computer  generated  forecasts  which  were 
broadcasted  over  WSFC  Washington's  NWF  for  a  10-day  period. 
Public  reaction  to  these  tests  was  favorable  and  indicated 
that  the  public  would  accent  a  broadcast  by  a  speech 
synthesizer  that  basically  filled  in  the  blanks  of  a 
forecast  using  pre-recorded  phrases  selected  from  a 
standardized  list  of  permissible  expressions. 

With  the  ("oast  Guard’s  operational  and  technical 
requirements  in  mind,  we  suggest  that.  initially  analysis 
synthesis  be  used  for  broadcasts  to  insure  the  most  natural 
sounding  speech.  Introducing  synthetic  speech  presents 
adjustments  to  the  listeners  of  the  broadcasts,  so  it  is 
essential  that  that  the  messages  be  transmitted  with  the 
highest  quality  speech  possible. 

Te  feel  that  ultimately  rule  synthesis  ha  very 
definite  advantages  for  its  potential  use.  first,  „ule 
s ynt bos izers  allow  an  essentially  unlimited  vccabulary. 


*  An  unpublished  working  ducument. 


Several  companies  have  recently  marketed  "  te  xt- 1  u-s  peec  h  " 
synthesizers,  where  users  merely  type  in  the  desired 
phonemes  and  the  unit  outputs  the  desired  vocabulary  items, 
he  have  already  mentioned  that  the  Coast  Guard  receives 
incoming  weather  reports  via  teletype.  Since  most  typp.o  cf 
synthesizers  accept  commands  via  ASCII,  we  suggest  that  the 
Coast  -uari  connect  a  "t  ext- 1  o~  speech"  rule  synthesizer  to  a 
teletype,  so  that  incoming  weather  reports  could  te  prepared 
for  broadcast,  as  they  are  received,  using  svrthesis. 

Assuming  that  Coast  Guard  communications  facilities  are 
to  be  automated,  a  computer  cculd  be  used  to  issue 
synthesized  broadcasts  to  mariners  at  specified  intervals. 
This  approach  eliminates  soundproof  bccths  a  r.d  speaker 
inconsistencies,  since  synthesized  speech  output  is  uniform 
an  1  Free  from  tackccound  ncise  and  variation  in 
it  icrophone-tc-mouth  distance.  speech  synthesis  technology 
ofcers  the  advantage  of  all  solid-state  electronics,  as 
opposed  to  analog  recording  techniques  now  used  ty  the  Coast 
ruard.  Such  electronics  have  proven  to  be  highly  reliable 
as  they  contain  no  moving  parts. 

Pule-type  t ext-to-speech  synthesizers  require  less 
programming  support.  than  do  other  types  of  speech 
synthesizers;  the  user  merely  types  in  the  required  text  at 
the  keyboard  of  a  CPI  terminal,  and  the  synthesizer  produces 
the  desired  output.  Pule  sv  r  t  he  s  iz.e  r  s  require  little 
linguistic  sophistication  on  the  part  of  the  user.  The 
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j  r  ice  of  this  type  of  synthesizer  has  seen  a  downward  trend 
recently,  and  should  continue  tc  decrease  ever  the  next 
several  years.  Tt  is  anticipated  that  th'>  quality  of  speech 
will  improve  over  time,  so  that  it  will  he  competitive  with 
the  very  natural  sounding  analysis  syrthesis. 

Later,  when  weather  reports  have  been  effectively 
broadcast  using  speech  syrthesis,  ether  routine  messages 
could  also  be  automatically  generated.  dust  as  there  is  a 
suggestion  of  initially  using  analysis  synthesis  for 
naxiraally  natural  sounding  speech  in  weather  reports  before 
introducing  the  extendable  rule  syrthesis,  a  similar  pattern 
cf  broadcasting  could  be  done  for  other  messages.  "he 
listeners  couIj  adjust  to  more  and  more  messages  being 
broadcast  synthetically,  if  the  quality  were  as  humanlike  as 
possible.  Then,  if  unlimited  vocabularies  were  described, 
rule  synthesis  which  is  more  roach ine -li k e ,  tut  capable  of 
generating  any  and  all  utterances,  including  new  words  and 
proper  names,  could  be  used.  Ey  carefully  monitoring 
listener  response,  the  Coast  Guard  cculd  determine  the 
rumber  and  types  of  messages  lhat  should  be  generated 
synt  het  icall.y .  Also,  records  could  b«  kept  on  the 
c  f  f  ect.  i  ve  re  s  s  cf  using  speech  synthesis  for  Coast  Guard 
broadcasts.  Tt  is  predicted  that  the  synthetic  message  will 
be  more  and  more  natural  sounding  as  the  technology 
continues  to  make  advances  and  that  public  acceptance  of 
computer  generated  broadcasts  will  be  regularly  increasing. 
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APPENDIX  A 


GLOSSARY  OF  TERMS  USED  IN  TUTS  REPORT 

This  glossary  is  intended  to  present  to  interested 
readers  definitions  of  various  terns  used  in  this 
report.  These  definitions  are  designed  to  facilitate 
comprehension  of  this  report  by  those  whose  technical 
expertise  nay  lie  outside  the  area  of  voice  technology. 


1)  accuracy  rate  -  performance  measures  given  by 
manufacturers  of  their  recognition  systems. 
These  figures  must  be  regarded  with  some  care.  The 
accuracy  rates  are  based  on  different  types  of 
vocabularies,  since  there  is  not  any  widely 
accepted  vocabulary  used  to  test  recognizers.  Each 
manufacturer  is  generally  free  to  use  its  own 
chosen  vocabulary.  Naturally,  some  vocabularies 
are  easier  than  others  for  recognition  success. 
Por  example,  "right"  vs.  "left"  and  "up"  vs. 
"down"  are  easier  to  distinguish  acoustically  from 
one  another  than  are  "right"  vs.  "ripe"  and  "down" 
vs.  "done".  Typically,  manufacturers  choose 
sample  recognition  vocabularies  that  are  maximally 
effective  for  their  own  recognizers.  Thus,  we 
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notice  that  rarely  will  any  manufacturer  of  speech 
recognizers  claia  an  accuracy  rate  of  less  than  95 
-  99%.  Thus,  it  is  extremely  important  to  test 
recognizers  with  the  intended  vocabulary  for  the 
user  who  will  purchase  the  system  in  order  to 
obtain  a  better  idea  as  to  what  the  accuracy  rate 
of  the  recognizer  will  actually  be  in  the  field. 

2)  adjustable  reject  setting  -  variation  of  the 

acceptance  threshold  for  vocabulary  items.  For 
example,  if  the  adjustable  reject  setting  is  too 
high,  the  accuracy  rate  will  be  reduced,  since 
only  input  vocabulary  items  with  a  very  high 
degree  of  statistical  correspondence  with  stored 
templates  will  be  recognized.  On  the  ether  hand, 
if  the  reject  level  is  set  too  low,  false 
recognitions  can  be  generated,  since  there  is  a 
relatively  lower  criterion  for  matching  input 
vocabulary  with  stored  templates.  Part  of  the 
solution  to  this  problem  involves  the  careful 
choice  of  vocabulary  items,  so  that  they  are 
maximally  distinct  acoustically. 

3)  analysis  synthesis  -  a  type  of  speech  generation  or 

speech  synthesis  that  is  based  upon  an  acoustic 
analysis  of  real  human  speech.  Basically,  an 
analysis  of  human  speech  can  be  used  to  define  the 
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gross  values  for  a  digital  filter  simulating  human 
speech.  This  type  of  synthesis  is  distinguished 
from  rule  synthesis  which  does  not  base  output 
speech  upon  real  human  speech,  but  on  basic 
combinations  of  acoustic  parameters  which  produce 
human-like  speech.  A  common  type  of  speech 
analysis  used  for  analysis  synthesis  is  linear 
prediction  coding  known  as  LPC  synthesis. 

4)  bit  rate  -  the  number  of  bits  that  are  used  to 
synthesize  digitally  a  speech  utterance.  The 
lower  the  bit  rate,  the  less  information  that  has 
to  be  stored  in  computer  memory.  This  is  an 
important  consideration  for  cost  effective  speech 
synthesizers.  The  higher  the  bit  rate,  the  more 
information  that  contributes  to  natural  sounding 
speech  synthesis. 

5)  b yte  -  one  unit  of  information  in  computing.  On  TB8 
systems,  there  are  8  bits  per  byte.  On  ASCII 
terminals,  there  are  7  bits  per  byte. 

6)  coarticulations  -  the  changes  in  the  acoustic 
parameters  of  speech  which  occur  between  adjacent 
vowel  and  consonant  sounds.  when  individual 
speech  sounds  are  joined  together  to  form  words 

i  and  sentences,  certain  of  their  acoustic 

1 

* 

parameters  are  affected  by  the  neighboring  sounds 
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7)  codec  -  a  device  which  stores  speech  data  which  have 
been  digitally  encoded. 


8)  computer  storage  -  both  "real  storage"  which  is  the 

amount  of  storage  reguired  in  the  central 
processing  unit  of  the  computer  and  "virtual 
storage"  which  is  storage  within  the  computer,  but 
not  part  of  the  central  processing  unit. 

9 )  connected  or  continuous  speech  recognizer  -  a 

recognizer  that  is  able  to  correctly  identify 
input  speech  which  consists  of  concatenated,  or 
connected,  strings  of  words.  This  is  a  much  more 
difficult  ta  si  than  recognition  by  isolated  word 
recognizers,  for  the  words  flow  continuously 
together  and  do  not  have  boundaries  of  silence 
between  them. 


10)  EPROM  -  eraseable,  prog  rammable ,  read  only  memory 

category  of  LSI  (large  scale  integrated)  chips. 

11)  formant  -  an  acoustic  resonant,  frequency  and  its 

associated  bandwidth.  Every  speech  sound  has  a  set 
of  formants  which  are  determined  by  the 
configuration  of  the  vocal  tract. 
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12)  fricative  speech  sounds  -  productions  of  speech 

which  are  predominantly  turbulent,  since  they  have 
a  noise  source  at  the  place  of  articulation. 
Examples  of  fricatives  include  the  first  sound  in 
each  of  the  following  words;  f un .  ve^y,  he. 

1 3)  isolated  word  recogn iz er  -  a  recognizer  that  is 

able  to  handle  only  single,  or  isolated,  wor  Is 
which  are  not  embedded  in  phrases  or  sentences, 
but  are  pronounced  with  boundaries  of  silence 
surrounding  them.  This  is  an  easier  recognition 
task  than  that  of  handling  "connected  speech" 
which  consists  of  words  strung  together  to  form 
whole  sentences,  at  a  regular  repetition  rate. 
Isolated  word  recognizers  generally  perform  best 
where  input  vocabulary  items  are  pronounce  1 
relatively  slowly  and  precisely,  both  lurinq 
training  and  actual  recognition. 

19)  ke yword  spotting  -  recognizing  certain  word<-  of 
specific  interest  in  connected  speech  input.  The 
Coast  guard  has  seen  keyword  spotting  of  such 
words  as  "mayday"  or  "fire"  as  a  means  of  backinj 
up  human  operators  who  are  assign*  1  to  monitor 
radio  receptions  on  distress  f  reguencies. 

15)  LPC  synthesis  -  a  type  of  speech  generation  or 
speech  synthesis  that  uses  a  "linear  prediction 
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coding"  model  of  speech.  This  approach  of  analysis 
synthesis  uses  a  digital  filter  to  model  the  human 
vocal  tract;  it  is  based  upon  the  statistical 
assumption  that  human  speech  changes  relatively 
slowly,  and  that  it  is  possible  to  predict,  the 
next  set  of  acoustic  measures  based  on  a  knowledge 
of  previous  ones. 

16)  IS  I  chic-  -  large  scale  integrated  circuits  which 

are  put  into  a  single  chip.  The  LSI  chip  industry 
is  a  key  part  of  speech  technology.  As  the  price 
for  such  chips  has  decreased  through  volume 
production  methods,  integrating  speech  technology 
into  new  product  areas  has  become  more  attractive. 

17)  nasal  speech  sou  nds  -  productions  of  speech  which 

ate  made  with  the  air  stream  being  emitted  only 
through  the  nose.  Fxaraples  of  nasals  include  the 
first  sound  in  the  word  mat  and  in  the  word  nice. 

1  e)  nonperiodic  sound  -  a  sound  which  does  not  have  a 
waveform  with  a  consistently  repetitive  rate.  For 
example,  vowels  are  characterized  by  having 
waveforms  which  are  basically  periodic  in  nature, 
but  plosives  and  fricatives  do  not  have  such 
waveforms.  These  consonants  have  nonperiodic 
waveforms  of  burst  and  turbulence.  We  have 
noticed  that  nonperiodic  sounds  were  commonly 
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found  in  Coast  Guard  radio  receptions,  such  as 
pops  and  clicks.  These  nonperiodic  sounds  can 
sometimes  generate  false  recognitions  when 
mistakenly  identified  as  plosives  or  fricatives. 

19)  O.E.M.  -  original  equipment  manufacturers.  This 
includes  companies  that  integrate  commercially 
purchased  items,  such  as  LSI  chips,  into  their  own 
products  which  are  again  sold  as  a  finished 
product. 

2d)  performance  an alysis  -  a  statistical  analysis  of 
human  or  huraan/equ ipment  performance.  The  basic 
concept  involves  identifying  variables  which 
significantly  affect  human  or  human/equipment 
performance  of  a  predefined  task  through  multiple 
regression  and  factor  analysis.  The  term 
performance  analysis  refers  to  actually  deriving  a 
regression  equation  which  describes  human  or 
human/equ ipment  interactions.  Such  an  equation 
would  be  helpful  by  identifying  variables  which 
significantly  affect  recognition  accuracy  rates. 
If  we  actually  had  a  performance  analysis  of 
speech  recognizers  which  gave  significant  results, 
we  might  be  able  to  identify  which  recognition 
algorithms  were  relatively  preferable  to  others. 


2  1)  phonemes  -  basic  units  of  sound  in  human  speech, 
sometimes  referred  to  as  the  vowels  and  consonants 
of  the  language.  Phonemes  are  discrete  sounds 
which  can  cause  a  difference  in  meaning  between 
otherwise  identical  words,  such  as  "bat”  and  "pat" 
or  "but"  and  "bit". 

22)  phonetic  context  -  the  location  of  an  acoustic 
entity  with  reference  to  surrounding  sounds.  The 
acoustic  context  has  a  noticeable  effect  upon 
phonemes.  Linguists  commonly  refer  to  allophones 
which  are  pronunciation  variants  of  basic  phonemes 
due  in  part  to  their  phonetic  contexts.  Since  a 
given  phoneme  can  have  a  variety  of  allophones, 
this  makes  the  overall  recognition  task  more 
difficult,  particularly  when  attempting  to  develop 
a  speaker-independent  recognizer  which  will  handle 
connected  speech  wit'  various  allophones. 

2  3)  plosive  speech  sounds  -  productions  of  speech  which 
involve  a  blocking  or  stoppage  of  the  air  flow 
from  the  vocal  tract.  Examples  of  plosives 
include  the  first  sound  in  each  of  the  following 
words:  ii&l,  door,  give. 

24)  prosodies  -  the  "supraseg  mental  s"  or  influences  of 
duration,  fundamental  frequency,  and  speech 
production  power  upon  basic  phonemes.  These 
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influences  include  emphasis,  stress,  and 
durational  patterns  of  the  vowels  and  consonants. 
The  prosodic  parameters  are  basic  to  both  the 
recognition  and  synthesis  of  speech. 

25)  rapid  speech  -  speech  which  is  produced 

considerably  faster  than  carefully  articulated 
speech. 

26)  ROM  -  read  only  memory  LSI  chip  category,  not 

erasable  or  programmable. 

2?)  rule  s ynthas is  -  a  type  of  speech  generation  or 

speech  synthesis  that  uses  a  set  of  rules  to  model 
speech.  This  approach  specifies  which 

combinations  of  acoustic  parameters  are  to  be  used 
to  best  imitate  human  speech. 

2P)  sampling  fate  -  the  frequency  with  which  speech 
recognizers  digitally  encode  speech  data.  Such 
digitized  data  are  actually  numerical  codings 
which  represent  the  translation  or  analysis  of  the 
real  speech  waves.  The  numerical  data  are  taken  at 
uniform  points  along  these  speech  waves  and 
expressed  as  a  function  of  time.  A  common 

sampling  rate  for  laboratory  analysis  of  speech 

waves  is  10,000  samples  per  second.  Commercial 
recognizers,  however,  tend  toward  a  lower  sampling 
rate  to  conserve  memory. 
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29)  spea  ker  de  pe adept  recogni  zer  -  a  recognizer  that. 


requires  "training”  by  an  individual  before 
recognition  of  that  individual's  speech  can  take 
place.  "Training"  generally  consists  of  having  the 
person  whose  vocabulary  is  to  be  input  for 
recognition  repeat  this  vocabulary  several  tines, 
so  that  templates  can  be  established  for  each 
vocabulary  item  of  a  given  individual.  Such 
templates  are  then  used  for  comparison  with 
incoming  vocabulary  items. 

30)  speaker  independent  recognizer  -  a  recognizer  that 

does  not  require  "training"  by  individual  speakers 
before  recognition  of  that  individual's  speech  can 
take  place.  A  speaker  independent  recognizer 
allows  virtually  any  person  to  input  speech  with 
no  stored  information  about  that  speaker's 
characteristics  to  aid  the  machine  in  its 
recognition  of  the  speech. 

31)  speech  recognizer  -  a  device  that  accepts  speech  as 

an  input  and  produces  typed  messages  or  action  by 
a  machine  controlled  by  the  recognizer  as  output. 

A  speech  recognizer  can  be  digital  or  analog.  It 
can  be  one  of  two  types:  1)  isolated  word,  or  2) 
connected  or  continuous  speech. 
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32)  speech  synthesizer  -  a  device  that  generates  speech 
mechanically.  K  speech  synthesizer  can  he  digital 
or  analog.  Tt.  can  be  one  of  three  types:  1) 

analysis  synthesizer,  2)  rule  synthesizer,  or  3) 
digital  recoding  synthesizer. 

3.3)  templates  -  acoustic  manifestations  of  words  or 
utterances  that  have  been  stored  in  digital  For  m. 
Templates  are  actually  sets  of  numbers 
representing  acoustically  derived  parameters. 
When  training  speech  recognizers,  templates  are 
set  up  as  a  speaker  repeats  his/her  vocabulary 
items  into  the  recognizer.  Later,  these  templates 
are  used  as  references  for  statistical  comparison 
to  input  vocabulary  items  to  determine  the 
identity  of  the  input  vocabulary. 

34)  voice  recognition  -  either  automatic  recognition  of 

words  and  sentences  which  are  spoken  into  a  speech 
recognizer  or  automatic  recognition  of  the  voice 
quality  of  the  speaker,  thus  serving  as  an 
identification  of  the  person  who  is  speaking.  In 
the  first  case,  "voice  recognition"  is  synonymous 
with  "speech  recognition"  and  in  the  second  case 
with  "speaker  recognition" 

35)  voiced  speech  sounds  -  productions  of  speech 

involving  a  vibration  of  the  true  vocal  folds. 
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Examples  of  voiced  sounds  include  the  first  sound 
in  each  of  the  following  words:  it.,  5.e#  b^g. 

36)  voiceless  speech  sounds  -  productions  of  speech  not 
involving  a  vibration  of  the  true  vocal  folds. 
Examples  of  voiceless  sounds  include  the  first 
sound  in  each  of  the  following  words:  keep,  saw. 
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APPENDIX  B 


DESCRIPTION  OF  ILS 

The  ILS  commands  have  been  written  to  utilize 
peripheral  sevices  such  as  disk  packs  for  data 
storage  and  retrieval,  the  line  printer  for 
listings,  and  the  analog-to-digital  and 
dig ita  1-to-a  na log  converters  as  means  of 
interfacing  the  analog  representation  of  signals 
with  the  digital  representation.  The  means  of 
interaction  with  the  system  is  a  terminal  which 
has  graphic  as  well  as  alphanumeric  (text) 
capahilit  ies. 

The  ILS  software  has  been  developed  as  a  set 
of  self-contained  program  modules  which  are 
utilized  serially.  Each  ILS  command  is  a  program 
module  which  executes  a  specific  task.  The  program 
modules  are  stored  on  disk  and  are  brought  into 
core  one  at  a  time  by  user  command.  Consequently, 
except  for  the  keyboard  monitoring  program,  the 
memory  resources  of  TLS  are  only  in  demand  when  an 
ILS  command  is  being  executed. 

Thus,  TLS  is  not  taking  up  memory  space 
during  the  time  the  user  is  just  thinking. 
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examining  a  display,  toying  the  next  command,  etc. 
This  is  an  iaportant.  factor  in  multi-user  systems 
which  may  have  memory  limitations. 

The  critical  provisions  for  communication  of 
parameter  values  between  program  modules  is  made 
possible  by  providing  on  disk  an  exclusive  file 
for  each  user.  This  file,  conveniently  designated 
as  the  user's  COMMON  file,  contains  global  system 
parameters  and  it  serves  as  a  work  area  for 
deposit  and  retrieval  of  information  by  all 
commands  executed  by  the  user.  The  acquisition  of 
shared  information  is  affected  as  each  module 
initiates  its  execution  by  reading  the 
disk- resident  user's  COMMON  file.  At  the 
conclusion  of  its  execution  each  module  then 
rewrites  back  onto  disk  the  updated  version  of  the 
user's  COMMON  file.  In  this  way  an  tls  module  can 
operate  on  previous  results  and  arguments  passed 
through  the  user's  COMMON  file  by  a  preceding 
modu  le. 

Because  of  the  modularity  of  the  system,  any 
program  module  raa  y  be  modified  without  affecting 
the  other  modules.  This  feature  also  permits  the 
replacement  or  addition  of  program  modules  on  disk 
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providing  they  are  properly  designed  to  be 

compatible  with  the  I LS  conventions.  Thus,  each 

\ 

user  can  have  his  own  tailored  ILS  commands. 

Principles  of  Ope ra tion 

The  software  complement  of  the  Interactive 
Laboratory  System  has  been  designed  to  function 
entirely  under  the  control  of  the  computer's  own 
operating  system.  In  this  way  the  TLS  modules  take 
advantage  of  existing  subroutines  and  file 
structures  available  within  the  computer 
processing  system. 

Tt  may  be  helpful  at  the  outset  to  describe 
the  memory  organization  of  the  computer  system 
very  simply  as  having  two  working  segments.  One 
segment  is  occupied  by  the  computer  operating 
system  (at  all  ti  lies)  and  the  other  is  available 
for  the  execution  of  programs  entered  by  various 
computer  users.  The  computer  operating  system  can 
be  described  in  most  general  terms  as  an  operating 
executive  which  allocates  computer  resources  to  a 
set  of  users.  This  system  itself  consists  of  a 


collection  of  programs  and  tables  which  are  used 
to  control  the  flow  of  information  processing 
within  the  computer. 


The  Interactive  Laboratory  System  is  an 
organized  collection  of  interrelated  but 
independent  program  modules.  These  disk-resident 
modules  are  independent  in  the  sense  that  each 
module  becomes  the  sole  occupant  of  the  user 
segment  of  core  when  called  out  by  user  command, 
and  thus  renders  a  solo  performance  as  far  as  the 
remaining  disk-bound  modules  are  concerned.  The 
interrelatedness  of  the  ILS  program  modules  is 
realized  through  the  passing  of  constants, 
variables,  and  arrays  through  each  user's  COnnON 
file  from  one  successive  module  to  another. 

It  may  further  be  helpful  to  identify  the 
actual  nature  of  the  program  modules  as  they  are 
placed  on  disk.  The  modules  actually  consist  of 
files  of  binary  data  which  are  computer 
translations  into  machine  language  of  the  original 
FORTRAN  source  program  written  by  the  ILS 
programmers.  when  read  into  core,  these  files 
become  operating  intelligence  in  executing  the 
objectives  written  into  a  module.  In  order  to  do 
this,  the  processing  system  of  the  computer  in 
effect  places  itself  at  the  disposal  of  the  ILS 
program  currently  resident  within  the  core  and 
implements  its  instructions. 
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LIST  OF  MANUFACTURERS 

Manufacturers  of  Speech  Recognition  Products : 

Centigr  am  Corp .  155  A  Moffett  Park  Drive,  Suite  108, 

Sunnyvale,  California  94086  (408)  734-3222 

Heuristics,  Inc.  1285  Hammerwood  Avenue,  Sunnyvale, 
California  94086  (408)  734-8532 

Interstate  Electronics  Corg.  (AT 0  subsidiary)  100  1  E. 
Ball  Poad,  P.O.  Box  3117,  Anaheim,  California 
92805  (7  14)  635-72  10 

N  ippon  EJlectric  Company,  Ltd.  NEC  America,  Tnc.  532 
Proadhollow  Road,  Melville,  New  York  11746  (516) 
752-9700 

Scot  t  Instruments  815  North  Flo,  Denton,  Texas  71201 
(817)  387-9514 

T  h  re  sh  o 1 d  Technology.  Inc.  1820  Underwood  Place, 
Delran,  New  Jersey  08075  (609)  461-9200 

Ver  be  x  C  or£>.  (Exxon  sub  sjd  ja  r  y)  2  Oak  Park,  Bedford, 

Massachusetts  01730  (617)  275-5160 
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V o icete k  P.O.  Box  388,  Goleta,  California  93017  (80S) 


695-  1954 


Inc.  26046  Eden  Landing  Road,  Suite  7,  Hayward, 
California  94545  (41  5)  785-8060 


Manufacturers  of  5 peech  S  yn thesis  Products  ; 

Centigram  Corp .  155A  Moffett  Park  Drive,  Suite  108, 

Sunnyvale,  California  94006  (409)  734-3227 

7e  no ra  1  Instruments  Corjo.  Microelectronics  Division, 
600  west  John  Street,  Bicksville,  New  York  11902 
(51  6)  733-31  07 

Interstate  EI§ct r on ics  Corg. (ATO  subsidiary)  1001  E. 
Dali  Poad,  P.O.  Box  3117,  Anaheim,  California 
92805  (714)  635-7210 

Kurzweil  Computer  Products,  Inc.  3  3  Cambridge  °arkway, 
Cambridge,  Massachusetts  02142  (617)  864-4700 

Maryland  Compute  r  5e  r vice  s ,  Inc.  50  2  Pock  Spring 

Avenue,  Bel  Air,  Maryland  21014  (301)  838-8888 
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H  i  aic  Electronics  P.0.  Box  921,  Acton,  Mass  n:h  i  .«♦  t 
01720  (617)  263-2101 

MSC.  1640  Monrovia,  Costa  Mesa,  r  ,i.  0  2  6  2  7  (’'4) 

64  2-24  77 

National  Semiconductor  Cor  p .  2900  Cem  icon  1  u c ♦  o t  Di  i  v**, 

Santa  Clara,  California  95051  (407) 

Percoa  Data  Cg.  ,  Inc.  211  North  Kirby,  larlanl,  ’»n 
75042  (714)  272-7421 

Telesens orv  Speech  s ysteBs.  I_nc .  iuom  Hill  *ip« 

Palo  Alto,  California  94  704  (4  15)  49  7-2*'J6 

Instruments  9600  Comaerce  Park  Dnv«,  "  i  i  *  .*  i  i  1 , 
Houston,  Texas  770  36  (  71  3)  7^6-6511 

Votr  ax  500  Stephenson  Highway,  Troy,  Michigan  . 

(313)  588-7050 


